Beware Big Data Technology Zealotry

Undoubtedly you’ve heard it all before: “Hadoop is the next big thing, why waste your time with a relational database?” or “Hadoop is really only good for the following things” or “Our NoSQL database scales, other solutions don’t.” Invariably, there are hundreds of additional arguments proffered by big data vendors and technology zealots inhabiting organizations just like yours. However, there are few crisp binary choices in technology decision making, especially in today’s heterogeneous big data environments.

Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.
Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.

Teradata CTO Stephen Brobst has a great story regarding a Stanford technology conference he attended. Apparently in one session there were “shouting matches” between relational database and Hadoop fanatics as to which technology better served customers going forward. Mr. Brobst wasn’t amused, concluding; “As an engineer, my view is that when you see this kind of religious zealotry on either side, both sides are wrong. A good engineer is happy to use good ideas wherever they come from.”

Considering various technology choices for your particular organization is a multi-faceted decision making process. For example, suppose you are investigating a new application and/or database for a mission critical job. Let’s also suppose your existing solution is working “good enough”. However, the industry pundits, bloggers and analysts are hyping and luring you towards the next big thing in technology. At this point, alarm bells should be ringing. Let’s explore why.

First, for companies that are not start-ups, the idea of ripping and replacing an existing and working solution should give every CIO and CTO pause. The use cases enabled by this new technology must significantly stand out.

Second, unless your existing solution is fully depreciated (for on-premises, hardware based solutions), you’re going to have a tough time getting past your CFO. Regardless of your situation, you’ll need compelling calculations for TCO, IRR and ROI.

Third, you will need to investigate whether your company has the skill sets to develop and operate this new environment, or whether they are readily available from outside vendors.

Fourth, consider your risk tolerance or appetite for failure—as in, if this new IT project fails—will it be considered a “drop in the bucket” or could it take down the entire company?

Finally, consider whether you’re succumbing to technology zealotry pitched by your favorite vendor or internal technologist. Oftentimes in technology decision making, the better choice is “and”, not “either”.

For example, more companies are adopting a heterogeneous technology environment for unified information where multiple technologies and approaches work together in unison to meet various needs for reporting, dashboards, visualization, ad-hoc queries, operational applications, predictive analytics, and more. In essence, think more about synergies and inter-operability, not isolated technologies and processes.

In counterpoint, some will argue that technology capabilities increasingly overlap, and with a heterogeneous approach companies might be paying for some features twice. It is true that lines are blurring regarding technology capabilities as some of today’s relational databases can accept and process JSON (previously the purview of NoSQL databases), queries and BI reports can run on Hadoop, and “discovery work” can complete on multiple platforms. However, considering the maturity and design of various competing big data solutions, it does not appear—for the immediate future—that one size will fit all.

When it comes to selecting big data technologies, objectivity and flexibility are paramount. You’ll have to settle on technologies based on your unique business and use cases, risk tolerance, financial situation, analytic readiness and more.

If your big data vendor or favorite company technologist is missing a toolbox or multi-faceted perspective and instead seems to employ a “to a hammer, everything looks like a nail” approach, you might want to look elsewhere for a competing point of view.

When Ideology Reigns Over Data

Increasingly, the mantra of “let the data speak for themselves” is falling by the wayside and ideology promotion is zooming down the fast lane. There are dangers to reputations, companies and global economies when researchers and/or statisticians either see what they want to see—despite the data, or worse, gently massage data to get “the right results.”

Courtesy of Flickr. By Windell Oskay
Courtesy of Flickr. By Windell Oskay

Economist Thomas Piketty is in the news. After publishing his treatise “Capital in the Twenty First Century”, Mr. Piketty was lauded by world leaders, fellow economists, and political commentators for bringing data and analysis to the perceived problem of growing income inequality.

In his book, Mr. Piketty posits that while wealth and income were grossly unequally distributed through the industrial revolution era, the advent of World Wars I and II changed the wealth dynamic as tax raises helped pay for war recovery and social safety nets. Then, after the early 1970s, Piketty claims that once again his data show the top 1-10% of earners take more than their fair share. In Capital, Piketty’s prescriptions to remedy wealth inequality include an annual tax on capital and harsh taxation of up to80% for the highest earners.

In this age of sharing and transparency, Mr. Piketty received acclaim for publishing his data sets and Excel spreadsheets for the entire world to see. However, this bold move could also prove to be his downfall.

The Financial Times, in a series of recent articles, claims that Piketty’s data and Excel spreadsheets don’t exactly line up with his conclusions. “The FT found mistakes and unexplained entries in his spreadsheet,” the paper reports. The articles also mention that a host of “transcription errors”, “incorrect formulas” and “cherry-picked” data mar an otherwise serious body of work.

Once all the above errors are corrected, the FT concludes; “There is little evidence in Professor Piketty’s original sources to bear out the thesis that an increasing share of total wealth is held by the richest few.” In other words, ouch!

Here’s part of the problem; while income data are somewhat hard to piece together, wealth data for the past 100 years is even harder to find because of data quality and collection issues. As such, the data are bound to be of dubious quality and/or incomplete. In addition, it appears that Piketty could have used some friends to check and double check his spreadsheet calculations to save him the Ken Rogoff/Carmen Reinhardt treatment.

In working with data, errors come with the territory and hopefully they are minimal. There is a more serious issue for any data worker however; seeing what you want to see, even if the evidence says otherwise.

For example, Nicolas Baverez, a French economist raised issues with Piketty’s data collection approach and “biased interpretation” of those data long before the FT report.  Furthermore, Baverez thinks that Piketty had a conclusion in mind before he analyzed the data. In the magazine Le Point, Baverez writes; “Thomas Piketty has chosen to place himself under the shadow of (Karl Marx), placing unlimited accumulation of capital in the center of his thinking”.

The point of this particular article is not to knock down Mr. Piketty, nor his lengthy and researched tome. Indeed we should not be so dismissive of Mr. Piketty’s larger message that there appears to be an increasing gap between haves and have nots, especially in terms of exorbitant CEO pay, stagnant middle class wages, and reduced safety net for the poorest Western citizens.

But Piketty appeared to have a solution in mind before he found a problem. He will readily admit; “I am in favor of wealth taxation.”  When ideology drives any data driven approach, it becomes just a little easier to discard data, observations and evidence that don’t exactly line up with what you’re trying to prove.

In 1977, statistician John W. Tukey said; “The greatest value of a picture is when it forces us to notice what we never expected to see.” Good science is the search for causes and explanations, sans any dogma, and willingness to accept outcomes contrary to our initial hypothesis. If we want true knowledge discovery, there can be no other way.

 

Be Wary of the Science of Hiring

Like it or not, “people analytics” are here to stay. But that doesn’t mean companies should put all their eggs in one basket and turn hiring and people management over to the algorithms. In fact, while reliance on experience/intuition to hire “the right person” is rife with biases, there’s also danger in over-reliance on HR analytics to find and cultivate the ultimate workforce.

Courtesy of Flickr. By coryccreamer
Courtesy of Flickr. By coryccreamer

The human workforce appears to be ripe with promise for analytics. After all, if companies can figure out a better way to measure the potential “fit” of employees to various roles and responsibilities, subsequent productivity improvements could be worth millions of dollars.  In this vein, HR analytics is the latest rage—where algorithms team through mountains of workforce data to identify the best candidates and predict which ones will have lasting success.

According to an article in Atlantic Magazine, efforts to quantify and measure the right factors in hiring and development have existed since the 1950s. Employers administered tests for IQ, math, vocabulary, vocational interest and personality to find key criteria that would help them acquire and maintain a vibrant workforce. However, with the Civil Rights Act of 1964, some of those practices were pushed aside due to possible bias in test formulation and administration.

Enter “Big Data”. Today, data scarcity is no longer the norm. In actuality, there’s an abundance of data on candidates who are either eager to supply them, or ignorant of the digital footprint they’ve left since leaving elementary school. And while personality tests are no longer in vogue, new types of applicant “tests” have emerged where applicants are encouraged to play games that watch—and measure how they solve problems and navigate obstacles—in online dungeons or fictitious dining establishments.

Capturing “Big Data” seems to be the least of challenges in workforce analytics. The larger issues are identifying key criteria for what makes a successful employee—and discerning how those criteria relate and interplay with each other.  For example, let’s say you’ve stumbled upon nirvana and found two key criteria for employee longevity.  Hire for that criteria and now you may have more loyal employees, but you still need to account and screen for “aptitude, skills, personal history, psychological stability, discretion”, work ethic and more. And how does one weight these criteria in a hiring model?

Next, presuming you’ve developed a reliable analytic model, it’s important to determine under which circumstances the model works.  In other words, does a model that works for hiring hamburger flippers in New York, also work for the same role in Wichita, Kansas?  Does seasonality have a play? Does weather? Does it matter the size of the company, or the prestige of its brand? Does the model work in economic recessions and expansions? As you can see, discovering all relevant attributes for “hiring the right person” in a given industry, much less role, and then weighting them appropriately is a challenge for the ages.

Worse, once your company has a working analytic model for human resource management, it’s important to not completely substitute it for subjective judgment.  For example in the Atlantic Magazine article, a high tech recruiting manager lamented: “Some of our hiring managers don’t even want to interview anymore, they just want to hire the people with the highest scores.”  It probably goes without saying, but this is surely a recipe for hiring disaster.

While HR analytics seems to have room to run, there’s still the outstanding question of whether “the numbers” matter at all in hiring the right person. For instance, Philadelphia Eagles coach, Chip Kelly was recently asked why he hired his current defensive coordinator, who had less than stellar numbers in his last stint with the Arizona Cardinals.

Chip Kelly responded: “I think people get so caught up in statistics that sometimes it’s baffling to me. You may look at a guy and say, ‘Well, they were in the bottom of the league defensively.’ Well, they had 13 starters out. They should be at the bottom of the league defensively.”

He continued: “I hired [former Oregon offensive coordinator and current Oregon head coach] Mark Helfrich as our offensive coordinator when I was at the University of Oregon. Their numbers were not great at Colorado. But you sit down and talk football with Helf for about 10 minutes. He’s a pretty sharp guy and really brought a lot to the table, and he’s done an outstanding job.”

Efficient data capture, data quality, proper algorithmic development and spurious correlations in too much big data are just a few of the problems yet to be solved in HR analytics. However, that won’t stop the data scientists from trying. Ultimately, the best hires won’t come exclusively from HR analytics, but will be paired with executive (subjective) judgment to find the ideal candidate for a given role. However, in the meantime, buckle your seatbelt for much more use of HR analytics. It’s going to be a bumpy ride.

 

When Big Data Loses to the Anecdote

It’s highly possible that in your next business meeting, you may have data; you may have a solid analysis and even your best recommendations based on a given set of facts, and still lose out to a competing presentation littered with personal anecdotes. That’s because while business cultures like to profess “In God we trust; all others must bring data,” the reality is that human beings still like a gripping narrative, and emotional stories can sometimes override what seems like the best decision on paper.

Courtesy of Flickr. By fotosterona
Courtesy of Flickr. By fotosterona

Try this scenario on for size; as a marketing professional you must convince your CEO that your product needs a radical overhaul. You have tabulated numerical results from surveys; you’ve captured customer comments from online forums, and you’re ready to fire off with “fact” after “fact” to make your case.

Your opponent is the best sales person in the company. While you present to the CEO and make your case, “Fred” sits there with amusement—not only because he doesn’t support your premise, but also because he’s armed with powerful anecdotes.  And when you’ve completed your “business case” chock full of facts and figures for a new product direction and approach, Fred calmly relates three personal customer stories on why your ideas will never work.  And then adds; “All the other sales reps in this company feel the same way.” Question: which argument do you think the CEO adopts?

If you’re honest, you’ll admit that the art of storytelling “wins” over “the facts” in most business cultures. One reason for this is that we’re often inclined to believe the personal anecdote over data. Mathematics professor John Allen Paulos says; “In listening to stories we tend to suspend disbelief in order to be entertained, whereas in evaluating statistics we generally have an opposite belief in order to not be beguiled.”  It’s as if we turn our brains “on” for a story, and “off” when numbers come up on the screen.

The second reason is that facts and figures tend to be abstract, whereas the personal anecdote seems “more real”. Take for instance Lucy Kellaway’s Financial Times column on Ryanair. This airline describes itself as “cheapo air”. To keep costs down, Ryanair invests more in infrastructure and operations than customer service. Surely, Ryanair’s CEO has more online chatter, tweets, and phone calls than he’d care to collect, read or listen to.  Ryanair is swimming in data.

However, Lucy Kellaway relates that Ryanair’s CEO recently did an about face on some of his most notorious airline policies. Why? Because he was tired of being accosted by angry customers when he dined out. While the online Twitterati complained, it was the personal and often angry anecdote—delivered in person—that caused this particular CEO to change his mind.

So it appears that the best strategy to make your next business case is a powerful narrative (goMalcolm Gladwell if you’re able), supported by a statistical underpinning where appropriate. Simply presenting numbers, for numbers sake, will only cause glassy eyes and blank stares from those you are trying to persuade. Keep in mind that while it’s tempting to believe that we “must bring the data” and lots of it in order to persuade, sometimes it’s the personal anecdote/s that end up making the final sale.

Science Needs Less Certainty

A disturbing trend is afoot, where key topics in science are increasingly considered beyond debate—or in other words settled. However, good science isn’t without question, discovery and even a bit of “humility”—something that scientists of all stripes (chemists, mathematicians, physicists and yes even data scientists) should remember.

Courtesy of Flickr. By epSos.de
Courtesy of Flickr. By epSos.de

Recently, the online site for Popular Science discontinued its online comments for certain topics. The reasoning for such a policy was clear according to an editor; “A politically motivated, decades-long war on expertise has eroded the popular consensus on a wide variety of scientifically validated topics. Everything, from evolution to the origins of climate change, is mistakenly up for grabs again. Scientific certainty is just another thing for two people to “debate” on television.”

Thus, it was clear that because the science behind a smattering of topics was settled, there was no need for further debate. Instead, the magazine promised to open comments for topics on “select articles that lend themselves to vigorous and intelligent discussion.”

Now one can hardly blame Popular Science. Commenting online has been out of hand for some time, especially when denizens of the internet choose character assassination and cheap shots to prove a point. And to be sure, instead of enlightened discussion, sometimes comment sections devolve to least common denominator thinking.

That said, Popular Science couldn’t be any more wrong. Last I checked, good science was all about hypothesizing, testing, discovery and repeatability. It was about debate on fresh and ancient ideas, with an understanding that there was little certitude and more probabilities in play, especially because the world around us is constantly changing. We’re learning more, discovering more, and changing our theories to reflect the latest evidence. We’re testing ideas, failing fast and moving on to the next experiment. And things we believe to be true today are sometimes proven either less true or completely false tomorrow.

However, it disturbs me to see debate cut off—on any topic—because we know the facts and the numbers prove them true. Facts change—as Christopher Columbus would attest, were he alive today. And worse, we have scientists who disparage others because 97% of “the collective” agree on a given topic. As if consensus determined what is true.

The blogger Epicurean Dealmaker laments on the same topic; “The undeniable strength of science as a domain of human thought is that it embeds skepticism…science is not science if it does not consist of theorems and hypotheses which are only—always and forever more—taken as potentially true until they are proven otherwise. And science itself declares its ambition to constantly test and retest its theories and assumptions for completeness, accuracy, and truth, even if this happens more often in theory than in fact.”

As we travel down the path of the next big thing –the transformation of multiple disciplines including business, medicine, artificial intelligence and more with “Big Data,” let us not forget that in a complex world—while our analysis and numbers prove one thing today—they may be woefully inadequate for tomorrow’s challenges.

So let’s encourage debate, discussion, testing and re-testing of theories and experimentation using data and analytic platforms to learn more about our customers, our companies and ourselves. And don’t shut off debate because everyone agrees—chances are they do not. The old adage, ‘conflict creates’ holds true, whether in the chemistry or data lab. The future of our companies, economies and societies depends on it.

Societal Remedies for Algorithms Behaving Badly

In a world where computer programs are responsible for wild market swings, advertising fraud and more, it is incumbent upon society to develop rules and possibly laws to keep algorithms—and programmers who write them—from behaving badly.

Courtesy of Flickr. By 710928003
Courtesy of Flickr. By 710928003

In the news, it’s hard to miss cases of algorithms running amok. Take for example, the “Keep Calm and Carry On” debacle, where t-shirts from Solid Gold Bomb Company were offered with variations on the WWII “Keep Calm” propaganda phrase such as “Keep Calm and Choke Her” or “Keep Calm—and Punch Her.” No person in their right mind would sell, much less buy, such an item. However, the combinations were made possible by an algorithm that generated random phrases and added them to the “Keep Calm” moniker.

In another instance, advertising agencies are buying online ads across hundreds of thousands of web properties every day. But according to a Financial Times article, PC hackers are deploying “botnet” algorithms to click on advertisements and run up advertiser costs.  This click-fraud is estimated to cost advertisers more than $6 million a month.

Worse, the “hash crash” on April 23, 2013, trimmed 145 points off the Dow Jones index in a matter of minutes. In this case, the Associated Press Twitter account was hacked by the Syrian Electronic Army, and a post went up mentioning “Two Explosions in the White House…with Barack Obama injured.”  With trading computers reading the news, it took just a few seconds for algorithms to shed positions in stock markets, without really understanding whether the AP tweet was genuine or not.

In the case of the “Keep Calm” and “hash crash” fiascos, companies quickly trotted out apologies and excuses for algorithms behaving badly.  Yet, while admission of guilt with promises to “do better” are appropriate, society can and should demand better outcomes.

First, it is possible to program algorithms to behave more honorably.  For example, IBM’s Watson team noticed that in preparation for its televised Jeopardy event that Watson would sometimes curse.  This was simply a programming issue as Watson would often scour its data sources for the most likely answer to a question, and sometimes those answers contained profanities. Watson programmers realized that a machine cursing on national television wouldn’t go over very well, thus programmers gave Watson a “swear filter” to avoid offensive words.

Second, public opprobrium is a valuable tool. The “Keep Calm” algorithm nightmare was written up in numerous online and mainstream publications such as the New York Times. Companies that don’t program algorithms in an intelligent manner could find their brands highlighted in case studies of “what not to do” for decades to come.

Third, algorithms that perform reckless behavior could (and in the instance of advertising fraud should) get a company into legal hot water. That’s the suggestion of Scott O’Malia, Commissioner of the Commodities Futures Trading Commission. According to a Financial Times article, O’Malia says in stock trading, “reckless behavior” might be “replacing market manipulation” as the standard for prosecuting misbehavior.  What constitutes “reckless” might be up for debate, but it’s clear that more financial companies are trading based on real-time news feeds. Therefore wherever possible, Wall Street quants should be careful to program algorithms to not perform actions that could wipe out financial holdings of others.

Algorithms –by themselves—don’t actually behave badly; after all, they are simply coded to perform actions when a specific set of conditions occurs.

Programmers must realize that in today’s world, with 24 hour news cycles, variables are increasingly correlated. In other words, when one participant moves, a cascade effect is likely to happen. Brands can also be damaged in the blink of an eye when poorly coded algorithms run wild. With this in mind, programmers—and the companies that employ them—need to be more responsible with their algorithmic development and utilize scenario thinking to ensure a cautious approach.

Preserving Big Data to Live Forever

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

Long term horizon by Irargerich. Courtesy of Flickr.
Long term horizon by Irargerich. Courtesy of Flickr.

There’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

Follow

Get every new post delivered to your Inbox.

Join 41 other followers

%d bloggers like this: