Beware Big Data Technology Zealotry

Undoubtedly you’ve heard it all before: “Hadoop is the next big thing, why waste your time with a relational database?” or “Hadoop is really only good for the following things” or “Our NoSQL database scales, other solutions don’t.” Invariably, there are hundreds of additional arguments proffered by big data vendors and technology zealots inhabiting organizations just like yours. However, there are few crisp binary choices in technology decision making, especially in today’s heterogeneous big data environments.

Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.
Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.

Teradata CTO Stephen Brobst has a great story regarding a Stanford technology conference he attended. Apparently in one session there were “shouting matches” between relational database and Hadoop fanatics as to which technology better served customers going forward. Mr. Brobst wasn’t amused, concluding; “As an engineer, my view is that when you see this kind of religious zealotry on either side, both sides are wrong. A good engineer is happy to use good ideas wherever they come from.”

Considering various technology choices for your particular organization is a multi-faceted decision making process. For example, suppose you are investigating a new application and/or database for a mission critical job. Let’s also suppose your existing solution is working “good enough”. However, the industry pundits, bloggers and analysts are hyping and luring you towards the next big thing in technology. At this point, alarm bells should be ringing. Let’s explore why.

First, for companies that are not start-ups, the idea of ripping and replacing an existing and working solution should give every CIO and CTO pause. The use cases enabled by this new technology must significantly stand out.

Second, unless your existing solution is fully depreciated (for on-premises, hardware based solutions), you’re going to have a tough time getting past your CFO. Regardless of your situation, you’ll need compelling calculations for TCO, IRR and ROI.

Third, you will need to investigate whether your company has the skill sets to develop and operate this new environment, or whether they are readily available from outside vendors.

Fourth, consider your risk tolerance or appetite for failure—as in, if this new IT project fails—will it be considered a “drop in the bucket” or could it take down the entire company?

Finally, consider whether you’re succumbing to technology zealotry pitched by your favorite vendor or internal technologist. Oftentimes in technology decision making, the better choice is “and”, not “either”.

For example, more companies are adopting a heterogeneous technology environment for unified information where multiple technologies and approaches work together in unison to meet various needs for reporting, dashboards, visualization, ad-hoc queries, operational applications, predictive analytics, and more. In essence, think more about synergies and inter-operability, not isolated technologies and processes.

In counterpoint, some will argue that technology capabilities increasingly overlap, and with a heterogeneous approach companies might be paying for some features twice. It is true that lines are blurring regarding technology capabilities as some of today’s relational databases can accept and process JSON (previously the purview of NoSQL databases), queries and BI reports can run on Hadoop, and “discovery work” can complete on multiple platforms. However, considering the maturity and design of various competing big data solutions, it does not appear—for the immediate future—that one size will fit all.

When it comes to selecting big data technologies, objectivity and flexibility are paramount. You’ll have to settle on technologies based on your unique business and use cases, risk tolerance, financial situation, analytic readiness and more.

If your big data vendor or favorite company technologist is missing a toolbox or multi-faceted perspective and instead seems to employ a “to a hammer, everything looks like a nail” approach, you might want to look elsewhere for a competing point of view.

Is the Purpose of Analytics Just to Turn a Buck?

Ask just about any company why they are jumpstarting an analytics program and you’ll undoubtedly hear phrases like “We need to reduce costs” or “We must find new customers” or even “We need to shorten our product time-to-market.” And while these are all definitely sound reasons to initiate and nurture an analytics program, there are other rationales beyond “business value” for architecting and implementing an analytical infrastructure and applications.

By TheFixer. Courtesy of Flickr.
By TheFixer. Courtesy of Flickr.

A recent Financial Times article mentions how top global business schools are trying to get away from primacy of “Increasing Shareholder Value.”  Indeed, MBA students around the world are generally taught that increasing shareholder value is job number one, and they should do so by cutting costs wherever possible, expanding revenue streams, improving employee productivity and more.

For MBAs, the focus on short term shareholder value is mostly because it’s uncomplicated. “If we can skip the discussions of corporate purpose by stipulating that corporations exist to create shareholder value, then it makes it easier to get down to the more technical details of how we get there,” says Gerald Davis, management professor at University of Michigan.

Bill George, former CEO of Medtronic, has long counseled companies to look past shareholder value as the sole criterion of business success. Instead he says business leaders should consider additional stakeholders of customers, suppliers, employees, and communities when making decisions.

As a business analytics professional, it’s often too easy for me to think about analytics in the business context (i.e. how they can reduce costs, increase profits, speed time-to-market, improve employee productivity etc.) In fact, the mission for analytics can easily cross over from the land of shareholder value to safeguarding and improving the well-being and long term sustainability of other stakeholders.

Examples include:

  • National weather services use Hadoop and NoSQL databases to collect data points from global weather stations and satellites, feed data into predictive climate models, and then recommend courses of action to citizens and governments
  • Police departments use analytics to predict “hotspots” of criminal activity based on past incidents to help prevent crime and if not, nab lawbreakers in the act.
  • Governments use real time data collection and analytics to produce readings on local and global air pollution so that citizens can make informed choices about their daily activities.
  • Governments collect and share data on crime and terrorism (and as we’ve seen lately, sometimes a little too well!)
  • Analytics speeds aid relief efforts when natural disasters occur
  • Predictive analytics tracks disease outbreaks in real time
  • Access to open data sets and analytics may help farmers in Africa and elsewhere lift millions out of poverty by producing better crop yields
  • Data scientists are encouraged to share their analytic skills with charities
  • Companies can track food products with supply chain analytics as they move from “field to fork” to promote food safety

These are just some examples of the value of analytics beyond shareholder value creation, and there are hundreds more.

Business schools across the globe are revamping their MBA curriculum to focus on shareholder value to a lesser extent and more on sustainability and value for all stakeholders. Perhaps it’s time to look at the worth analytics can bring through a broader and more significant lens of improving societal value, and not just shareholder profits.

Technologies and Analyses in CBS’ Person of Interest

Person of Interest is a broadcast television show on CBS where a “machine” predicts a person most likely to die within 24-48 hours. Then, it’s up to a mercenary and a data scientist to find that person and help them escape their fate. A straight forward plot really, but not so simple in terms of the technologies and analyses behind the scenes that could make a modern day prediction machine a reality. I have taken the liberty of framing some components that could be part of such a project.  Can you help discover more?

CBSIn Person of Interest, “the machine” delivers either a single name or group of names predicted to meet an untimely death. However, in order to predict such an event, the machine must collect and analyze reams of big data and then produce a result set, which is then delivered to “Harold” (the computer scientist).

In real life, such an effort would be a massive undertaking on a national basis, much less by state or city. However, let’s dispense with the enormities—or plausibility of such a scenario and instead see if we can identify various technologies and analyses that could make a modern day “Person of Interest” a reality.

It is useful to think of this analytics challenge in terms of a framework: data sources, data acquisition, data repository, data access and analysis and finally, delivery channels.

First, let’s start with data sources. In Person of Interest, the “machine” collects data from various sources such as interactions from: cameras (images, audio and video), call detail records, voice (landline and mobile), GPS for location data, sensor networks, and text sources (social media, web logs, newspapers, internet etc.). Data sets stored in relational databases that are publicly and not publicly available might also be used for predictive purposes.

Next, data must be assimilated or acquired into a data management repository (most likely a multi-petabyte bank of computer servers). If data are acquired in near real time, they may go into a data warehouse and/or Hadoop cluster (maybe cloud based) for analysis and mining purposes. If data are analyzed in real time, it’s possible that complex event processing technologies (i.e. streams in memory) are used to analyze data “on the fly” and make instant decisions.

Analysis can be done at various points—during data streaming (CEP), in the data warehouse after data ingest (which could be in just a few minutes), or in Hadoop (batch processed).  Along the way, various algorithms may be running which perform functions such as:

  • Pattern analysis – recognizing and matching voice, video, graphics, or other multi-structured data types. Could be mining both structured and multi-structured data sets.
  • Social network (graph) analysis – analyzing nodes and links between persons. Possibly using call detail records, web data (Facebook, Twitter, LinkedIn and more).
  • Sentiment analysis – scanning text to reveal meaning as in when someone says; “I’d kill for that job” – do they really mean they would murder someone, or is this just a figure of speech?
  • Path analysis – what are the most frequent steps, paths and/or destinations by those predicted to be in danger?
  • Affinity analysis – if person X is in a dangerous situation, how many others just like him/her are also in a similar predicament?

It’s also possible that an access layer is needed for BI types of reporting, dashboard, or visualization techniques.

Finally, delivery of the result set –in this case – name of the person “the machine” predicts most likely to be killed in the next twenty four hours, could be sent to a device in the field either a mobile phone, tablet, computer terminal etc.

These are just some of the technologies that would be necessary to make a “real life” prediction machine possible, just like in CBS’ Person of Interest. And I haven’t even discussed networking technologies (internet, intranet, compute fabric etc.), or middleware that would also fit in the equation.

What technologies are missing? What types of analysis are also plausible to bring Person of Interest to life? What’s on the list that should not be? Let’s see if we can solve the puzzle together!

How Mobile Operators are Mining Big Data

Mobile phone operators have long mined details on voice and data transactions to measure service quality, place cellular towers in optimal locations and even respond to tariff and rate disputes among various carriers.  But, that’s just scratching the surface for getting value from mobile data.

Image courtesy of Flickr. Milica Sekulic.

Call detail records (CDR) for mobile transactions are particularly interesting for analysis purposes.  According to a Wikipedia entry, CDRs are chock full of useful data for carriers including phone numbers for originator and call receiver, start time, duration, route, call type (voice, SMS, data) among other nuggets. It’s not unusual for mobile operators to mine 100 terabytes (TB) and up databases to optimize networks, strategically position service personnel, perform customer service requests and more.

And carriers are also starting to discover value in performing social network analysis (SNA) in relational databases and MapReduce/Hadoop platforms to analyze social/relationship connections, find influencers, and –if directed by government authorities—even perform crime syndication tracking or terrorist network monitoring.

While the types of analysis listed above are becoming commonplace, mobile phone operators are learning a lot more from “Big Data” analysis of everything they’re capturing.

Financial Times writer Gillian Tett explores some of these innovative approaches in a recent article (registration required). Tett notes that with mobile phone subscribers topping out at 2.5 billion subscribers in emerging markets alone, that mobile carriers, behavioral scientists and governments are learning more about “people’s movements, habits, and ideas.”

For example, Tett cites the 2010 Haitian earthquake where aid workers alongside researchers were able to “track Sim cards inside Haitians’ mobile phones.”  This in turn helped relief agencies analyze where populations dispersed and helped route food and medicine to where it was needed most.

Analyst firm IDC notes that smartphone sales are flying out the door at the tune of 400 million a quarter. With the rise of smartphones, there are also more mapping and location based applications online too. In fact, when billing, use, location, social networks, much less content accessed and more come into view, there will be little left to the imagination to complete a picture of who you are, where you’ve been, what you’re doing, and where you’re predicted to go next.

These types of rich information will be accessed for customer, corporate and societal benefit. However, there’s also ripe potential for mis-use. The key questions are – is this much ado about nothing, or a data collection spree with an unhappy ending?

Data, Feces and the Future of Healthcare

University of California computer scientist Dr. Larry Smarr is a man on a mission—to measure everything his body consumes, performs, and yes, discharges. For Dr. Smarr, this data collection has a goal –to fine tune his ecosystem in order to beat a potentially incurable disease. Is this kind of rigorous information collection and analysis the future of healthcare?

Talk to a few friends and you’ll probably find those who count calories, steps, or even chart exercise and/or eating regiments.  But it’s not very likely that your friends are quantifying their personal lives like Larry Smarr.

Atlantic Magazine’s June/July 2012 issue describes efforts of Dr. Larry Smarr in capturing his personal data – but not necessarily those of financial or internet viewing habits. Dr. Smarr is capturing health data, and lots of it. He uses armbands to record skin temperature, headbands to monitor sleep patterns, has blood drawn eight times a year, MRIs and ultrasounds when needed, and regular colonoscopies. And of course, he writes down every bite of food and also collects his own stool samples and then ships them to a laboratory.

Monitoring calories makes sense, but stools are also “information rich” says Smarr. “There are about 100 billion bacteria per gram. Each bacterium has DNA whose length is typically one to ten megabases—call it one million bytes of information,” Smarr exclaims. “This means human stool has a data capacity of 100,000 terabytes of information (~97 petabytes) stores per gram.” And all kinds of interesting information on the digestive tract, liver and pancreas can be culled from feces including infection, nutrient absorption and even cancer.

Armed with all this health data, Dr. Smarr is attempting to “model” his ecosystem. This means producing a working model that when fed inputs, can help report, analyze and eventually predict potential health issues. Just as sensor and diagnostic data are useful for auto manufacturers to perform warranty and quality analysis, Dr. Smarr is collecting and analyzing data to fine tune how his human body performs its functions.

But there’s more to the story. In his charting process, Dr. Smarr noticed his C-reactive protein (CRP) count was high—which rises in response to inflammation.  “Troubled, I showed my graphs to my doctors and suggested that something bad was about to happen,” he says.  Believing his higher CRP count was acting as an early warning system, Carr was dismissed by doctors as too caught up in finding a problem where there was none.

Two weeks later Dr. Smarr felt a severe pain in the side of his abdomen.  This time, the doctors diagnosed him with an acute bout of diverticulitis (bowel inflammation) and told him to take antibiotics. But Dr. Smarr wasn’t convinced. He tested his stools and came up with additional alarming numbers that suggested his diverticulitis was perhaps something more—early Crohn’s disease which is an incurable and uncomfortable GI tract condition.  The diagnosis of Crohn’s was subsequently confirmed by doctors.

Critics of “measuring everything” in terms of healthcare suggest that by focusing on massive personal data collection and analysis we’ll all turn into hypochondriacs, looking for ghosts in the machine when there are none. Or, as Nassim Taleb argues; the more variables we test, the disproportionately higher the number of spurious results that appear (to be)”statistically significant”.  And there is also the argument is that predictive analytics may do more harm than good in suggesting potential for illness where a patient may never end up developing a given disease. Correlation is not a cause in other words.

That said, you’d have a hard time convincing Dr. Smarr that patients, healthcare providers and even society at large couldn’t benefit more by quantifying and analyzing inputs, outputs thus gaining a better understanding of our own “system health”.  And fortunately, due to Moore’s Law and today’s software applications, our ability to apply brute force computation to our data-rich problems is now not only possible, it’s available now.

However, what sometimes makes sense conceptually is often much more of a difficult implementation in the real world. A sluggish healthcare system, data privacy issues, and lack of data scientists to perform big data analysis are potential roadblocks in seeing the “quantified life”—for everyone—become a reality any time soon.


  • Does data collection and analysis methods as described in this article portend a revolution in healthcare?
  • If everyone rigorously collects and analyzes their personal health data, could this end up raising or reducing overall healthcare costs?

Can Big Data Analytics Solve “Too Big to Fail” Banking Complexity?

Despite investing millions upon millions of dollars in information technology systems, analytical modeling and PhD talent sourced from the best universities, global banks still have difficulty understanding their own business operations and investment risks, much less complex financial markets. Can “Big Data” technologies such as MapReduce/Hadoop, or even more mature technologies like BI/Data Warehousing help banks make better sense of their own complex internal systems and processes, much less tangled and interdependent global financial markets?

Courtesy of Flickr

British physicist and cosmologist, Stephen Hawking, in 2000 said; “I think the next century will be the century of complexity.” He wasn’t kidding.

While Hawking was surely speaking of science and technology, it’s of little doubt he’d also look at global financial markets and financial players (hedge funds, banks, institutional and individual investors and more) as a very complex system.

With hundreds of millions of hidden connections and interdependencies, hundreds of thousands of various hard-to-understand financial products, and millions if not billions of “actors” each with their own agenda, global financial markets are the perfect example of extreme complexity.  In fact, the global financial system is so complex that even attempts to analytically model and predict markets may have worked for a point in time, but ultimately failed to help companies manage their investment risks.

Some argue that complexity in markets might be deciphered through better reporting and transparency.  If every financial firm were required to provide deeper transparency into their positions, transactions, and contracts, then might it be possible for regulators to more thoroughly police markets?

Financial Times writer Gillian Tett has been reading the published work of Professor Henry Hu at University of Texas.  In Tett’s article; “How ‘too big to fail’ banks have become ‘too complex to exist’ (registration required)” she says that Professor Hu argues technological advances and financial innovation (i.e. derivatives) have made financial instruments and flows too difficult to map. Moreover, Hu believes financial intermediaries themselves are so complex that they’ll continually have difficulty making sense of shifting markets.

Is a “too big to fail” situation exacerbated by a “too complex to exist” problem? And can technological advances such as further adoption of MapReduce or Hadoop platforms be considered a potential savior?  Hu seems to believe that supercomputers and more raw economic data might be one way to better understand complex financial markets.

However, even if massive data sets can be better searched, counted, aggregated and reported with MapReduce/Hadoop platforms, superior cognitive skills are necessary to make sense of outputs and then make recommendations and/or take actions based on findings. This kind of talent is in short supply.

It’s even highly likely the scope of complexity in financial markets is beyond today’s technology to compute, sort and analyze. And if that supposition is true, should next steps be to take measures to moderate if not minimize additional complexity?


  • Are “Big Data” analytics the savior to mapping complex and global financial flows?
  • Is the global financial system—with its billions of relationships and interdependencies—past the point of understanding and prediction with mathematics and today’s compute power?

Are Data Scientists the Next Masters of the Universe?

Back in the late 1970s, traders buying and selling mortgages were pushed aside for new masters of the universe—“quants” or individuals that used mathematics to slice and dice mortgages into debt tranches. And in the same way, today’s traditional Business Intelligence (BI) professionals must be looking over their collective shoulders as business and IT publications tout the emerging role of “data scientist”.

Before Lew Ranieri came on the scene, mortgages were a very staid business. Banks would loan money and keep assets on the books for up to thirty years (depending on how quickly the loan was paid back). Except for underwriting skills, there wasn’t much complexity to the mortgage business.

As a trader for Salomon Brothers, Lew Ranieri changed all that.  Ranieri’s insight was that mortgages could be bundled together and then sliced into different tranches of varied risk.  This slicing exercise was quite complex because of a buyer’s ability to prepay their loans early or refinance.  Michael Lewis, of Liar’s Poker fame writes; “Mortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan…mortgages were about math.” 

Suddenly the very boring business of home loans became a very complex business challenge in how to slice the pie based on risk profiles and cash flows from interest and principal. Lewis writes; “Different investors place different prices on risk. Risk could be canned and sold like tomatoes.” And this mathematical complexity demanded a new skill set—quantitative analysis—to perform the necessary mathematical modeling to ensure investment banks remained profitable in this new business.

Pushed out by a new breed of mathematical whizz-kids, many former investment bankers and traders either retired or left for smaller financial firms. And the rise of the quants—or the new masters of the universe—was complete by the mid-1980s.

Is a similar shift happening in the field of Business Intelligence with the emerging “data scientist” role? The skill set of today’s data scientist is much more robust than one who solely performs BI or ETL application development.  With new sources and types of data (i.e. multi-structured), the data scientist must be able to develop new data driven products such as churn models, create recommendation algorithms, assist marketers with behavioral segmentation and targeting and more.

But that’s not all. Fellow SmartDataCollective contributor Daniel Tunkelang says the data scientist; “Also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve that will create value for users and drive business decisions.”  Tall order to find all these skill sets in one person, much less build an internal competency center with such talent.

Perhaps for the foreseeable future, there’s room for both traditional BI professionals and the new breed of data scientists, as today both are valuable contributors in the field of analytics. However, with data growth on a fast paced exponential curve, much less the complexity and velocity of multi-structured data, it’s easy to see how the mix of skill sets to succeed in the future will tilt more in favor of the data scientist role.

The mortgage bankers never saw Lew Ranieri coming. Regarding the rise of data scientists—should traditional BI professionals be worried?