Is Your IT Architecture Ready for Big Data?

Built in the 1950s, California’s aqueduct is an engineering marvel that transports water from Northern California mountain ranges into thirsty coastal communities. But faced with a potentially lasting drought, California’s aqueduct is running below capacity as there’s not enough water coming from sources. In terms of big data, just the opposite is likely happening in your organization—too much big data, overflowing the river banks and causing havoc. And it’s only going from bad to worse.

Courtesy of Flickr. Creative Commons. By Herr Hans Gruber
Courtesy of Flickr. Creative Commons. By Herr Hans Gruber

The California aqueduct is a thing of beauty. As described in an Atlantic magazine article;

“A network of rivers, tributaries, and canals deliver runoff from the Sierra Mountain Range’s snowpack to massive pumps at the southern end of the San Joaquin Delta.” From there, these hydraulic pumps push water to California cities via a forty four mile aqueduct that traverses the state and dumps into various local reservoirs.

You likely have something analogous to a big data aqueduct in your organization. For example, source systems kick off data in various formats, which probably go through some refining process and end up in relational format. Excess digital exhaust is conceivably kept in compressed storage onsite or a remote location. It’s a continual process whereby data are continually ingested, stored, moved, processed, monitored and analyzed throughout your organization.

But with big data, there’s simply too much of it coming your way. Author James Gleick describes it this way; “The information produced and consumed by humankind used to vanish—that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Now expectations have inverted. Everything may be recorded and preserved, at least potentially: every musical performance; every crime in a shop, elevator, or city street; every volcano or tsunami on the remotest shore.” In short, everything that can be recorded is fair game, and likely sits on a server somewhere in the world.

So what got us here in terms of IT architecture isn’t going to be able to handle the immense data flood coming our way without a serious upgrade in terms of capability and alignment.

IT architecture can essentially be thought of as a view from above, or a blueprint of various structures and components and how they function together. In this context, we’re concerned with what an overall blueprint of business, information, applications and systems looks like today and what it needs to look like to meet future business needs.

We need a rethink of our architectural approaches for big data. To be sure, some companies—maybe 10%–will never need to harness multi-structured data types. They may never need to dabble with or implement open source technologies. To recommend some sort of “big data” architecture for these types of companies is counter-productive.

However, the other 90% of companies are waking up and realizing that today’s IT architecture and infrastructure won’t be able to meet their future needs. These companies desperately need to assess their current situation and future business needs, and then design an architecture that will deliver insights from all data types, not just those that fit neatly into relational rows and/or columns.

The big data onslaught will continue for the foreseeable future, and is only going to grow more intense from exponential data growth. But here’s the challenge: the human mind tends to think linearly—we simply don’t know how to plan for, much less capitalize on this exponential data growth. As such, the business, information, application and systems infrastructures—at most companies—aren’t equipped to cope with, much less harness the coming big data flood.

Want to be prepared? It’s important to take a fresh look at your existing IT architecture—and make sure that your data management, data processing, development tools, integration and analytic systems are up to snuff. And whatever your future plans are, consider doubling down on them.

Until convincing proof shows otherwise, it’s simply too risky not to have a well thought out plan to cope with stormy days ahead of too much big data.

Changing Your Mind About Big Data Isn’t Dumb

After all the hype about big data and its mental cousin Hadoop, some CIOs are getting skittish about investing additional money in a big data program without a clear business case.  Indeed, in terms of big data it’s OK to step back and think critically about what you’re doing, pause your programs for a time if necessary, and—yes, even change your mind about big data.

Courtesy of Flickr. Creative Commons. By Steven Depolo
Courtesy of Flickr. Creative Commons. By Steven Depolo

Economist and Federal Reserve Chairman, Alan Greenspan, has changed his mind many times. In aFinancial Times article, columnist Gillian Tett, chronicles Greenspan’s multiple positions on the value of gold. Tett says in his formative years, Greenspan was fascinated with the idea of the gold standard (i.e. pegging the value of a currency to a given amount of gold), but later was a staunch defender of fiat currencies.  And now, in his sunset years, Greenspan has shifted again saying; “Gold is a currency. It is still, by all evidence, a premier currency. No fiat currency, including the dollar, can match it.”

To me at least, Greenspan’s fluctuating positions on gold reflect a mind that continually adapts to new information.  Some would view Greenspan as “waffler”, or someone who cannot make up his mind. I don’t see it that way. Changing your mind isn’t a sign of weakness; rather it shows pragmatic and adaptive thinking that mutates as market or business conditions shift.

So what does any of this have to do with the concept of big data? While big data and associated big data technologies have enjoyed plenty of hype, there’s a new reality setting in regarding getting more value from big data investments.

Take for example a Barclays survey where a large percentage of CIOs were “uncertain”—thus far—as to the value of Hadoop because of the ongoing costs of support, training, hiring hard to find operations and development staff, and the necessary work to make Hadoop integrate with existing enterprise systems.

In another survey of 111 U.S. data scientists sponsored by Paradigm4, twenty-two percent of those surveyed said Hadoop and Spark were not well-suited to their analytics. And in the same survey, thirty-five percent of data scientists who tried Hadoop or Spark have stopped using it.

And earlier in the year, Gartner analyst Svetlana Sicular noted that big data has fallen into Gartner’s trough of disillusionment by commenting; “My most advanced with Hadoop clients are also getting disillusioned…these organizations have fascinating ideas, but they are disappointed with a difficulty of figuring out reliable solutions.”

With all this in mind, I think it makes sense to take a step back and assess your big data progress.  If you are one of those early Hadoop adopters, it’s a good time to examine your current program, report on results, and test against any return on investment (hard dollar or soft benefits) projections you’ve made. Or maybe you have never formalized a business case for big data? Here’s your chance to work up that business case, because future capital investments will likely depend on it.

In fact, now’s the perfect opportunity for deeper thinking on your big data investments. It’s time to go beyond the big data pilot and put effort into strategies for integrating these pilots with the rest of your enterprise systems.  And it’s also time to think long and hard about how to make your analytics “consumable by the masses”, or in other words, making your analytics accessible to many more business users than those currently using your systems.

And maybe you are in the camp of charting a different course for big data investments. Perhaps business conditions aren’t just right at the current moment, or there’s an executive shift that warrants a six month reprieve to focus on other core items.  If this is your situation, it might not be a bad idea to let an ever changing big data technology and vendor landscape shake out a bit before jumping back in.

To be clear, there’s no suggestion—whatsoever—to abandon your plans to harness big data. Now that would be dumb. But much like Alan Greenspan’s shifting opinions on gold, it’s also perfectly OK to re-assess your current position, and chart a more pragmatic and flexible course towards big data results.

Beware Big Data Technology Zealotry

Undoubtedly you’ve heard it all before: “Hadoop is the next big thing, why waste your time with a relational database?” or “Hadoop is really only good for the following things” or “Our NoSQL database scales, other solutions don’t.” Invariably, there are hundreds of additional arguments proffered by big data vendors and technology zealots inhabiting organizations just like yours. However, there are few crisp binary choices in technology decision making, especially in today’s heterogeneous big data environments.

Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.
Courtesy of Flickr. Creative Commons. By Eden, Janine, and Jim.

Teradata CTO Stephen Brobst has a great story regarding a Stanford technology conference he attended. Apparently in one session there were “shouting matches” between relational database and Hadoop fanatics as to which technology better served customers going forward. Mr. Brobst wasn’t amused, concluding; “As an engineer, my view is that when you see this kind of religious zealotry on either side, both sides are wrong. A good engineer is happy to use good ideas wherever they come from.”

Considering various technology choices for your particular organization is a multi-faceted decision making process. For example, suppose you are investigating a new application and/or database for a mission critical job. Let’s also suppose your existing solution is working “good enough”. However, the industry pundits, bloggers and analysts are hyping and luring you towards the next big thing in technology. At this point, alarm bells should be ringing. Let’s explore why.

First, for companies that are not start-ups, the idea of ripping and replacing an existing and working solution should give every CIO and CTO pause. The use cases enabled by this new technology must significantly stand out.

Second, unless your existing solution is fully depreciated (for on-premises, hardware based solutions), you’re going to have a tough time getting past your CFO. Regardless of your situation, you’ll need compelling calculations for TCO, IRR and ROI.

Third, you will need to investigate whether your company has the skill sets to develop and operate this new environment, or whether they are readily available from outside vendors.

Fourth, consider your risk tolerance or appetite for failure—as in, if this new IT project fails—will it be considered a “drop in the bucket” or could it take down the entire company?

Finally, consider whether you’re succumbing to technology zealotry pitched by your favorite vendor or internal technologist. Oftentimes in technology decision making, the better choice is “and”, not “either”.

For example, more companies are adopting a heterogeneous technology environment for unified information where multiple technologies and approaches work together in unison to meet various needs for reporting, dashboards, visualization, ad-hoc queries, operational applications, predictive analytics, and more. In essence, think more about synergies and inter-operability, not isolated technologies and processes.

In counterpoint, some will argue that technology capabilities increasingly overlap, and with a heterogeneous approach companies might be paying for some features twice. It is true that lines are blurring regarding technology capabilities as some of today’s relational databases can accept and process JSON (previously the purview of NoSQL databases), queries and BI reports can run on Hadoop, and “discovery work” can complete on multiple platforms. However, considering the maturity and design of various competing big data solutions, it does not appear—for the immediate future—that one size will fit all.

When it comes to selecting big data technologies, objectivity and flexibility are paramount. You’ll have to settle on technologies based on your unique business and use cases, risk tolerance, financial situation, analytic readiness and more.

If your big data vendor or favorite company technologist is missing a toolbox or multi-faceted perspective and instead seems to employ a “to a hammer, everything looks like a nail” approach, you might want to look elsewhere for a competing point of view.

Is the Purpose of Analytics Just to Turn a Buck?

Ask just about any company why they are jumpstarting an analytics program and you’ll undoubtedly hear phrases like “We need to reduce costs” or “We must find new customers” or even “We need to shorten our product time-to-market.” And while these are all definitely sound reasons to initiate and nurture an analytics program, there are other rationales beyond “business value” for architecting and implementing an analytical infrastructure and applications.

By TheFixer. Courtesy of Flickr.
By TheFixer. Courtesy of Flickr.

A recent Financial Times article mentions how top global business schools are trying to get away from primacy of “Increasing Shareholder Value.”  Indeed, MBA students around the world are generally taught that increasing shareholder value is job number one, and they should do so by cutting costs wherever possible, expanding revenue streams, improving employee productivity and more.

For MBAs, the focus on short term shareholder value is mostly because it’s uncomplicated. “If we can skip the discussions of corporate purpose by stipulating that corporations exist to create shareholder value, then it makes it easier to get down to the more technical details of how we get there,” says Gerald Davis, management professor at University of Michigan.

Bill George, former CEO of Medtronic, has long counseled companies to look past shareholder value as the sole criterion of business success. Instead he says business leaders should consider additional stakeholders of customers, suppliers, employees, and communities when making decisions.

As a business analytics professional, it’s often too easy for me to think about analytics in the business context (i.e. how they can reduce costs, increase profits, speed time-to-market, improve employee productivity etc.) In fact, the mission for analytics can easily cross over from the land of shareholder value to safeguarding and improving the well-being and long term sustainability of other stakeholders.

Examples include:

  • National weather services use Hadoop and NoSQL databases to collect data points from global weather stations and satellites, feed data into predictive climate models, and then recommend courses of action to citizens and governments
  • Police departments use analytics to predict “hotspots” of criminal activity based on past incidents to help prevent crime and if not, nab lawbreakers in the act.
  • Governments use real time data collection and analytics to produce readings on local and global air pollution so that citizens can make informed choices about their daily activities.
  • Governments collect and share data on crime and terrorism (and as we’ve seen lately, sometimes a little too well!)
  • Analytics speeds aid relief efforts when natural disasters occur
  • Predictive analytics tracks disease outbreaks in real time
  • Access to open data sets and analytics may help farmers in Africa and elsewhere lift millions out of poverty by producing better crop yields
  • Data scientists are encouraged to share their analytic skills with charities
  • Companies can track food products with supply chain analytics as they move from “field to fork” to promote food safety

These are just some examples of the value of analytics beyond shareholder value creation, and there are hundreds more.

Business schools across the globe are revamping their MBA curriculum to focus on shareholder value to a lesser extent and more on sustainability and value for all stakeholders. Perhaps it’s time to look at the worth analytics can bring through a broader and more significant lens of improving societal value, and not just shareholder profits.

Technologies and Analyses in CBS’ Person of Interest

Person of Interest is a broadcast television show on CBS where a “machine” predicts a person most likely to die within 24-48 hours. Then, it’s up to a mercenary and a data scientist to find that person and help them escape their fate. A straight forward plot really, but not so simple in terms of the technologies and analyses behind the scenes that could make a modern day prediction machine a reality. I have taken the liberty of framing some components that could be part of such a project.  Can you help discover more?

CBSIn Person of Interest, “the machine” delivers either a single name or group of names predicted to meet an untimely death. However, in order to predict such an event, the machine must collect and analyze reams of big data and then produce a result set, which is then delivered to “Harold” (the computer scientist).

In real life, such an effort would be a massive undertaking on a national basis, much less by state or city. However, let’s dispense with the enormities—or plausibility of such a scenario and instead see if we can identify various technologies and analyses that could make a modern day “Person of Interest” a reality.

It is useful to think of this analytics challenge in terms of a framework: data sources, data acquisition, data repository, data access and analysis and finally, delivery channels.

First, let’s start with data sources. In Person of Interest, the “machine” collects data from various sources such as interactions from: cameras (images, audio and video), call detail records, voice (landline and mobile), GPS for location data, sensor networks, and text sources (social media, web logs, newspapers, internet etc.). Data sets stored in relational databases that are publicly and not publicly available might also be used for predictive purposes.

Next, data must be assimilated or acquired into a data management repository (most likely a multi-petabyte bank of computer servers). If data are acquired in near real time, they may go into a data warehouse and/or Hadoop cluster (maybe cloud based) for analysis and mining purposes. If data are analyzed in real time, it’s possible that complex event processing technologies (i.e. streams in memory) are used to analyze data “on the fly” and make instant decisions.

Analysis can be done at various points—during data streaming (CEP), in the data warehouse after data ingest (which could be in just a few minutes), or in Hadoop (batch processed).  Along the way, various algorithms may be running which perform functions such as:

  • Pattern analysis – recognizing and matching voice, video, graphics, or other multi-structured data types. Could be mining both structured and multi-structured data sets.
  • Social network (graph) analysis – analyzing nodes and links between persons. Possibly using call detail records, web data (Facebook, Twitter, LinkedIn and more).
  • Sentiment analysis – scanning text to reveal meaning as in when someone says; “I’d kill for that job” – do they really mean they would murder someone, or is this just a figure of speech?
  • Path analysis – what are the most frequent steps, paths and/or destinations by those predicted to be in danger?
  • Affinity analysis – if person X is in a dangerous situation, how many others just like him/her are also in a similar predicament?

It’s also possible that an access layer is needed for BI types of reporting, dashboard, or visualization techniques.

Finally, delivery of the result set –in this case – name of the person “the machine” predicts most likely to be killed in the next twenty four hours, could be sent to a device in the field either a mobile phone, tablet, computer terminal etc.

These are just some of the technologies that would be necessary to make a “real life” prediction machine possible, just like in CBS’ Person of Interest. And I haven’t even discussed networking technologies (internet, intranet, compute fabric etc.), or middleware that would also fit in the equation.

What technologies are missing? What types of analysis are also plausible to bring Person of Interest to life? What’s on the list that should not be? Let’s see if we can solve the puzzle together!

How Mobile Operators are Mining Big Data

Mobile phone operators have long mined details on voice and data transactions to measure service quality, place cellular towers in optimal locations and even respond to tariff and rate disputes among various carriers.  But, that’s just scratching the surface for getting value from mobile data.

Image courtesy of Flickr. Milica Sekulic.

Call detail records (CDR) for mobile transactions are particularly interesting for analysis purposes.  According to a Wikipedia entry, CDRs are chock full of useful data for carriers including phone numbers for originator and call receiver, start time, duration, route, call type (voice, SMS, data) among other nuggets. It’s not unusual for mobile operators to mine 100 terabytes (TB) and up databases to optimize networks, strategically position service personnel, perform customer service requests and more.

And carriers are also starting to discover value in performing social network analysis (SNA) in relational databases and MapReduce/Hadoop platforms to analyze social/relationship connections, find influencers, and –if directed by government authorities—even perform crime syndication tracking or terrorist network monitoring.

While the types of analysis listed above are becoming commonplace, mobile phone operators are learning a lot more from “Big Data” analysis of everything they’re capturing.

Financial Times writer Gillian Tett explores some of these innovative approaches in a recent article (registration required). Tett notes that with mobile phone subscribers topping out at 2.5 billion subscribers in emerging markets alone, that mobile carriers, behavioral scientists and governments are learning more about “people’s movements, habits, and ideas.”

For example, Tett cites the 2010 Haitian earthquake where aid workers alongside researchers were able to “track Sim cards inside Haitians’ mobile phones.”  This in turn helped relief agencies analyze where populations dispersed and helped route food and medicine to where it was needed most.

Analyst firm IDC notes that smartphone sales are flying out the door at the tune of 400 million a quarter. With the rise of smartphones, there are also more mapping and location based applications online too. In fact, when billing, use, location, social networks, much less content accessed and more come into view, there will be little left to the imagination to complete a picture of who you are, where you’ve been, what you’re doing, and where you’re predicted to go next.

These types of rich information will be accessed for customer, corporate and societal benefit. However, there’s also ripe potential for mis-use. The key questions are – is this much ado about nothing, or a data collection spree with an unhappy ending?

Data, Feces and the Future of Healthcare

University of California computer scientist Dr. Larry Smarr is a man on a mission—to measure everything his body consumes, performs, and yes, discharges. For Dr. Smarr, this data collection has a goal –to fine tune his ecosystem in order to beat a potentially incurable disease. Is this kind of rigorous information collection and analysis the future of healthcare?

Talk to a few friends and you’ll probably find those who count calories, steps, or even chart exercise and/or eating regiments.  But it’s not very likely that your friends are quantifying their personal lives like Larry Smarr.

Atlantic Magazine’s June/July 2012 issue describes efforts of Dr. Larry Smarr in capturing his personal data – but not necessarily those of financial or internet viewing habits. Dr. Smarr is capturing health data, and lots of it. He uses armbands to record skin temperature, headbands to monitor sleep patterns, has blood drawn eight times a year, MRIs and ultrasounds when needed, and regular colonoscopies. And of course, he writes down every bite of food and also collects his own stool samples and then ships them to a laboratory.

Monitoring calories makes sense, but stools are also “information rich” says Smarr. “There are about 100 billion bacteria per gram. Each bacterium has DNA whose length is typically one to ten megabases—call it one million bytes of information,” Smarr exclaims. “This means human stool has a data capacity of 100,000 terabytes of information (~97 petabytes) stores per gram.” And all kinds of interesting information on the digestive tract, liver and pancreas can be culled from feces including infection, nutrient absorption and even cancer.

Armed with all this health data, Dr. Smarr is attempting to “model” his ecosystem. This means producing a working model that when fed inputs, can help report, analyze and eventually predict potential health issues. Just as sensor and diagnostic data are useful for auto manufacturers to perform warranty and quality analysis, Dr. Smarr is collecting and analyzing data to fine tune how his human body performs its functions.

But there’s more to the story. In his charting process, Dr. Smarr noticed his C-reactive protein (CRP) count was high—which rises in response to inflammation.  “Troubled, I showed my graphs to my doctors and suggested that something bad was about to happen,” he says.  Believing his higher CRP count was acting as an early warning system, Carr was dismissed by doctors as too caught up in finding a problem where there was none.

Two weeks later Dr. Smarr felt a severe pain in the side of his abdomen.  This time, the doctors diagnosed him with an acute bout of diverticulitis (bowel inflammation) and told him to take antibiotics. But Dr. Smarr wasn’t convinced. He tested his stools and came up with additional alarming numbers that suggested his diverticulitis was perhaps something more—early Crohn’s disease which is an incurable and uncomfortable GI tract condition.  The diagnosis of Crohn’s was subsequently confirmed by doctors.

Critics of “measuring everything” in terms of healthcare suggest that by focusing on massive personal data collection and analysis we’ll all turn into hypochondriacs, looking for ghosts in the machine when there are none. Or, as Nassim Taleb argues; the more variables we test, the disproportionately higher the number of spurious results that appear (to be)”statistically significant”.  And there is also the argument is that predictive analytics may do more harm than good in suggesting potential for illness where a patient may never end up developing a given disease. Correlation is not a cause in other words.

That said, you’d have a hard time convincing Dr. Smarr that patients, healthcare providers and even society at large couldn’t benefit more by quantifying and analyzing inputs, outputs thus gaining a better understanding of our own “system health”.  And fortunately, due to Moore’s Law and today’s software applications, our ability to apply brute force computation to our data-rich problems is now not only possible, it’s available now.

However, what sometimes makes sense conceptually is often much more of a difficult implementation in the real world. A sluggish healthcare system, data privacy issues, and lack of data scientists to perform big data analysis are potential roadblocks in seeing the “quantified life”—for everyone—become a reality any time soon.

Questions:

  • Does data collection and analysis methods as described in this article portend a revolution in healthcare?
  • If everyone rigorously collects and analyzes their personal health data, could this end up raising or reducing overall healthcare costs?
Follow

Get every new post delivered to your Inbox.

Join 42 other followers

%d bloggers like this: