Technologies and Analyses in CBS’ Person of Interest

Person of Interest is a broadcast television show on CBS where a “machine” predicts a person most likely to die within 24-48 hours. Then, it’s up to a mercenary and a data scientist to find that person and help them escape their fate. A straight forward plot really, but not so simple in terms of the technologies and analyses behind the scenes that could make a modern day prediction machine a reality. I have taken the liberty of framing some components that could be part of such a project.  Can you help discover more?

CBSIn Person of Interest, “the machine” delivers either a single name or group of names predicted to meet an untimely death. However, in order to predict such an event, the machine must collect and analyze reams of big data and then produce a result set, which is then delivered to “Harold” (the computer scientist).

In real life, such an effort would be a massive undertaking on a national basis, much less by state or city. However, let’s dispense with the enormities—or plausibility of such a scenario and instead see if we can identify various technologies and analyses that could make a modern day “Person of Interest” a reality.

It is useful to think of this analytics challenge in terms of a framework: data sources, data acquisition, data repository, data access and analysis and finally, delivery channels.

First, let’s start with data sources. In Person of Interest, the “machine” collects data from various sources such as interactions from: cameras (images, audio and video), call detail records, voice (landline and mobile), GPS for location data, sensor networks, and text sources (social media, web logs, newspapers, internet etc.). Data sets stored in relational databases that are publicly and not publicly available might also be used for predictive purposes.

Next, data must be assimilated or acquired into a data management repository (most likely a multi-petabyte bank of computer servers). If data are acquired in near real time, they may go into a data warehouse and/or Hadoop cluster (maybe cloud based) for analysis and mining purposes. If data are analyzed in real time, it’s possible that complex event processing technologies (i.e. streams in memory) are used to analyze data “on the fly” and make instant decisions.

Analysis can be done at various points—during data streaming (CEP), in the data warehouse after data ingest (which could be in just a few minutes), or in Hadoop (batch processed).  Along the way, various algorithms may be running which perform functions such as:

  • Pattern analysis – recognizing and matching voice, video, graphics, or other multi-structured data types. Could be mining both structured and multi-structured data sets.
  • Social network (graph) analysis – analyzing nodes and links between persons. Possibly using call detail records, web data (Facebook, Twitter, LinkedIn and more).
  • Sentiment analysis – scanning text to reveal meaning as in when someone says; “I’d kill for that job” – do they really mean they would murder someone, or is this just a figure of speech?
  • Path analysis – what are the most frequent steps, paths and/or destinations by those predicted to be in danger?
  • Affinity analysis – if person X is in a dangerous situation, how many others just like him/her are also in a similar predicament?

It’s also possible that an access layer is needed for BI types of reporting, dashboard, or visualization techniques.

Finally, delivery of the result set –in this case – name of the person “the machine” predicts most likely to be killed in the next twenty four hours, could be sent to a device in the field either a mobile phone, tablet, computer terminal etc.

These are just some of the technologies that would be necessary to make a “real life” prediction machine possible, just like in CBS’ Person of Interest. And I haven’t even discussed networking technologies (internet, intranet, compute fabric etc.), or middleware that would also fit in the equation.

What technologies are missing? What types of analysis are also plausible to bring Person of Interest to life? What’s on the list that should not be? Let’s see if we can solve the puzzle together!

How Mobile Operators are Mining Big Data

Mobile phone operators have long mined details on voice and data transactions to measure service quality, place cellular towers in optimal locations and even respond to tariff and rate disputes among various carriers.  But, that’s just scratching the surface for getting value from mobile data.

Image courtesy of Flickr. Milica Sekulic.

Call detail records (CDR) for mobile transactions are particularly interesting for analysis purposes.  According to a Wikipedia entry, CDRs are chock full of useful data for carriers including phone numbers for originator and call receiver, start time, duration, route, call type (voice, SMS, data) among other nuggets. It’s not unusual for mobile operators to mine 100 terabytes (TB) and up databases to optimize networks, strategically position service personnel, perform customer service requests and more.

And carriers are also starting to discover value in performing social network analysis (SNA) in relational databases and MapReduce/Hadoop platforms to analyze social/relationship connections, find influencers, and –if directed by government authorities—even perform crime syndication tracking or terrorist network monitoring.

While the types of analysis listed above are becoming commonplace, mobile phone operators are learning a lot more from “Big Data” analysis of everything they’re capturing.

Financial Times writer Gillian Tett explores some of these innovative approaches in a recent article (registration required). Tett notes that with mobile phone subscribers topping out at 2.5 billion subscribers in emerging markets alone, that mobile carriers, behavioral scientists and governments are learning more about “people’s movements, habits, and ideas.”

For example, Tett cites the 2010 Haitian earthquake where aid workers alongside researchers were able to “track Sim cards inside Haitians’ mobile phones.”  This in turn helped relief agencies analyze where populations dispersed and helped route food and medicine to where it was needed most.

Analyst firm IDC notes that smartphone sales are flying out the door at the tune of 400 million a quarter. With the rise of smartphones, there are also more mapping and location based applications online too. In fact, when billing, use, location, social networks, much less content accessed and more come into view, there will be little left to the imagination to complete a picture of who you are, where you’ve been, what you’re doing, and where you’re predicted to go next.

These types of rich information will be accessed for customer, corporate and societal benefit. However, there’s also ripe potential for mis-use. The key questions are – is this much ado about nothing, or a data collection spree with an unhappy ending?

Data, Feces and the Future of Healthcare

University of California computer scientist Dr. Larry Smarr is a man on a mission—to measure everything his body consumes, performs, and yes, discharges. For Dr. Smarr, this data collection has a goal –to fine tune his ecosystem in order to beat a potentially incurable disease. Is this kind of rigorous information collection and analysis the future of healthcare?

Talk to a few friends and you’ll probably find those who count calories, steps, or even chart exercise and/or eating regiments.  But it’s not very likely that your friends are quantifying their personal lives like Larry Smarr.

Atlantic Magazine’s June/July 2012 issue describes efforts of Dr. Larry Smarr in capturing his personal data – but not necessarily those of financial or internet viewing habits. Dr. Smarr is capturing health data, and lots of it. He uses armbands to record skin temperature, headbands to monitor sleep patterns, has blood drawn eight times a year, MRIs and ultrasounds when needed, and regular colonoscopies. And of course, he writes down every bite of food and also collects his own stool samples and then ships them to a laboratory.

Monitoring calories makes sense, but stools are also “information rich” says Smarr. “There are about 100 billion bacteria per gram. Each bacterium has DNA whose length is typically one to ten megabases—call it one million bytes of information,” Smarr exclaims. “This means human stool has a data capacity of 100,000 terabytes of information (~97 petabytes) stores per gram.” And all kinds of interesting information on the digestive tract, liver and pancreas can be culled from feces including infection, nutrient absorption and even cancer.

Armed with all this health data, Dr. Smarr is attempting to “model” his ecosystem. This means producing a working model that when fed inputs, can help report, analyze and eventually predict potential health issues. Just as sensor and diagnostic data are useful for auto manufacturers to perform warranty and quality analysis, Dr. Smarr is collecting and analyzing data to fine tune how his human body performs its functions.

But there’s more to the story. In his charting process, Dr. Smarr noticed his C-reactive protein (CRP) count was high—which rises in response to inflammation.  “Troubled, I showed my graphs to my doctors and suggested that something bad was about to happen,” he says.  Believing his higher CRP count was acting as an early warning system, Carr was dismissed by doctors as too caught up in finding a problem where there was none.

Two weeks later Dr. Smarr felt a severe pain in the side of his abdomen.  This time, the doctors diagnosed him with an acute bout of diverticulitis (bowel inflammation) and told him to take antibiotics. But Dr. Smarr wasn’t convinced. He tested his stools and came up with additional alarming numbers that suggested his diverticulitis was perhaps something more—early Crohn’s disease which is an incurable and uncomfortable GI tract condition.  The diagnosis of Crohn’s was subsequently confirmed by doctors.

Critics of “measuring everything” in terms of healthcare suggest that by focusing on massive personal data collection and analysis we’ll all turn into hypochondriacs, looking for ghosts in the machine when there are none. Or, as Nassim Taleb argues; the more variables we test, the disproportionately higher the number of spurious results that appear (to be)”statistically significant”.  And there is also the argument is that predictive analytics may do more harm than good in suggesting potential for illness where a patient may never end up developing a given disease. Correlation is not a cause in other words.

That said, you’d have a hard time convincing Dr. Smarr that patients, healthcare providers and even society at large couldn’t benefit more by quantifying and analyzing inputs, outputs thus gaining a better understanding of our own “system health”.  And fortunately, due to Moore’s Law and today’s software applications, our ability to apply brute force computation to our data-rich problems is now not only possible, it’s available now.

However, what sometimes makes sense conceptually is often much more of a difficult implementation in the real world. A sluggish healthcare system, data privacy issues, and lack of data scientists to perform big data analysis are potential roadblocks in seeing the “quantified life”—for everyone—become a reality any time soon.

Questions:

  • Does data collection and analysis methods as described in this article portend a revolution in healthcare?
  • If everyone rigorously collects and analyzes their personal health data, could this end up raising or reducing overall healthcare costs?

Can Big Data Analytics Solve “Too Big to Fail” Banking Complexity?

Despite investing millions upon millions of dollars in information technology systems, analytical modeling and PhD talent sourced from the best universities, global banks still have difficulty understanding their own business operations and investment risks, much less complex financial markets. Can “Big Data” technologies such as MapReduce/Hadoop, or even more mature technologies like BI/Data Warehousing help banks make better sense of their own complex internal systems and processes, much less tangled and interdependent global financial markets?

Courtesy of Flickr

British physicist and cosmologist, Stephen Hawking, in 2000 said; “I think the next century will be the century of complexity.” He wasn’t kidding.

While Hawking was surely speaking of science and technology, it’s of little doubt he’d also look at global financial markets and financial players (hedge funds, banks, institutional and individual investors and more) as a very complex system.

With hundreds of millions of hidden connections and interdependencies, hundreds of thousands of various hard-to-understand financial products, and millions if not billions of “actors” each with their own agenda, global financial markets are the perfect example of extreme complexity.  In fact, the global financial system is so complex that even attempts to analytically model and predict markets may have worked for a point in time, but ultimately failed to help companies manage their investment risks.

Some argue that complexity in markets might be deciphered through better reporting and transparency.  If every financial firm were required to provide deeper transparency into their positions, transactions, and contracts, then might it be possible for regulators to more thoroughly police markets?

Financial Times writer Gillian Tett has been reading the published work of Professor Henry Hu at University of Texas.  In Tett’s article; “How ‘too big to fail’ banks have become ‘too complex to exist’ (registration required)” she says that Professor Hu argues technological advances and financial innovation (i.e. derivatives) have made financial instruments and flows too difficult to map. Moreover, Hu believes financial intermediaries themselves are so complex that they’ll continually have difficulty making sense of shifting markets.

Is a “too big to fail” situation exacerbated by a “too complex to exist” problem? And can technological advances such as further adoption of MapReduce or Hadoop platforms be considered a potential savior?  Hu seems to believe that supercomputers and more raw economic data might be one way to better understand complex financial markets.

However, even if massive data sets can be better searched, counted, aggregated and reported with MapReduce/Hadoop platforms, superior cognitive skills are necessary to make sense of outputs and then make recommendations and/or take actions based on findings. This kind of talent is in short supply.

It’s even highly likely the scope of complexity in financial markets is beyond today’s technology to compute, sort and analyze. And if that supposition is true, should next steps be to take measures to moderate if not minimize additional complexity?

Questions:

  • Are “Big Data” analytics the savior to mapping complex and global financial flows?
  • Is the global financial system—with its billions of relationships and interdependencies—past the point of understanding and prediction with mathematics and today’s compute power?

Are Data Scientists the Next Masters of the Universe?

Back in the late 1970s, traders buying and selling mortgages were pushed aside for new masters of the universe—“quants” or individuals that used mathematics to slice and dice mortgages into debt tranches. And in the same way, today’s traditional Business Intelligence (BI) professionals must be looking over their collective shoulders as business and IT publications tout the emerging role of “data scientist”.

Before Lew Ranieri came on the scene, mortgages were a very staid business. Banks would loan money and keep assets on the books for up to thirty years (depending on how quickly the loan was paid back). Except for underwriting skills, there wasn’t much complexity to the mortgage business.

As a trader for Salomon Brothers, Lew Ranieri changed all that.  Ranieri’s insight was that mortgages could be bundled together and then sliced into different tranches of varied risk.  This slicing exercise was quite complex because of a buyer’s ability to prepay their loans early or refinance.  Michael Lewis, of Liar’s Poker fame writes; “Mortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan…mortgages were about math.” 

Suddenly the very boring business of home loans became a very complex business challenge in how to slice the pie based on risk profiles and cash flows from interest and principal. Lewis writes; “Different investors place different prices on risk. Risk could be canned and sold like tomatoes.” And this mathematical complexity demanded a new skill set—quantitative analysis—to perform the necessary mathematical modeling to ensure investment banks remained profitable in this new business.

Pushed out by a new breed of mathematical whizz-kids, many former investment bankers and traders either retired or left for smaller financial firms. And the rise of the quants—or the new masters of the universe—was complete by the mid-1980s.

Is a similar shift happening in the field of Business Intelligence with the emerging “data scientist” role? The skill set of today’s data scientist is much more robust than one who solely performs BI or ETL application development.  With new sources and types of data (i.e. multi-structured), the data scientist must be able to develop new data driven products such as churn models, create recommendation algorithms, assist marketers with behavioral segmentation and targeting and more.

But that’s not all. Fellow SmartDataCollective contributor Daniel Tunkelang says the data scientist; “Also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve that will create value for users and drive business decisions.”  Tall order to find all these skill sets in one person, much less build an internal competency center with such talent.

Perhaps for the foreseeable future, there’s room for both traditional BI professionals and the new breed of data scientists, as today both are valuable contributors in the field of analytics. However, with data growth on a fast paced exponential curve, much less the complexity and velocity of multi-structured data, it’s easy to see how the mix of skill sets to succeed in the future will tilt more in favor of the data scientist role.

The mortgage bankers never saw Lew Ranieri coming. Regarding the rise of data scientists—should traditional BI professionals be worried?