Preserving Big Data to Live Forever

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

Long term horizon by Irargerich. Courtesy of Flickr.

Long term horizon by Irargerich. Courtesy of Flickr.

There’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

Excel Model Errors – Don’t Throw the Baby out with the Bathwater

Two noted economists, Kenneth Rogoff and Carmen Reinhardt, recently had their findings on country debt to GDP ratios questioned, as it was discovered an Excel spreadsheet error led to some grave miscalculations. And while plenty of financial bloggers and economists took the opportunity to gloat over Rogoff and Reinhardt’s misfortune, there is a larger point here: just because mathematical calculations are wrong, it doesn’t mean a particular idea isn’t directionally sound.

Courtesy of Flickr. By BlackLineSystems

Courtesy of Flickr. By BlackLineSystems

In 2010, Rogoff and Reinhardt published a paper on the link between high public debt and slower economic growth. Their findings showed that when a country reached a debt level of greater than 90% of GDP, that country’s growth would slow to a crawl. This paper was subsequently used as the empirical basis for fiscal austerity—or belt tightening—for many European countries.

However, since the publishing of Rogoff and Reinhardt’s 2010 paper, their findings have been under intense scrutiny. Facing pressure to release their methodology and data, Rogoff and Reinhardt finally let other statisticians examine the study’s underlying calculations.

When Rogoff and Reinhardt’s Excel spreadsheets were released, a pair of graduate students discovered some coding errors. One key error omitted five countries from the calculations, which changed the mean of negative 0.1% economic growth to a positive 2.2%, a pretty significant switch! In other words, the conclusion that the “magic number of 90% debt to GDP equates to slow growth” wasn’t so magical after all.

Predictably, mainstream economists like Paul Krugman were quick to pounce. In his column, “Holy Coding Error, Batman”, Krugman called the error “embarrassing”, a “failure” and concluded it was reason enough to discount the underlying message that countries with higher debt could see slower growth in the future.

Krugman’s gloating aside, we should note that just because calculations supporting a particular idea are wrong, it doesn’t necessarily mean the proverbial “baby” should be tossed out with the “bathwater”.

Here’s why: an article on WSJ’s Market Watch cites a few studies showing 88% of spreadsheets contain errors of some kind. Ray Panko, a professor of IT management at University of Hawaii says that spreadsheet “errors are pandemic”.

Now whether you believe the 88% number is correct, or even if you discount it by half—as a consultant friend of mine suggests—it’s still a whopper of a number!

Going forward, with the knowledge a fair percentage of excel calculations are likely flawed in some manner, it makes sense that while we should expect the numbers supporting an idea need to be accurate, we should also understand that there could be errors. And because there could be calculation errors, we need to decide if the idea—outside any erroneous calculations—is a sound idea, or not.

Of course, there are instances where it’s critical to get mathematical calculations correct such as launching rockets, landing planes, engineering a building or bridge etc. But let’s also be careful not to immediately throw away an idea as “false” simply because it’s discovered someone made a correctible excel spreadsheet error.

Getting back to the Rogoff and Reinhardt commotion, this is exactly what Financial Times columnist Anders Aslund has in mind when he writes, “(While) the critique of Reinhart and Rogoff correctly identifies some technical errors in their work, one cannot read it and conclude the case for austerity is much weakened. High public debt is still a serious problem.”  I would add this is especially true for countries where their debt is not denominated in their own currency.

With the realization that most spreadsheets have errors, we should check, double check, and triple check Excel calculations to ensure accuracy. Peer review of excel calculations is also a recommended approach.

But let’s also not be so quick to throw out perfectly good ideas where it’s discovered some excel miscalculations, or omissions have skewed the results.  After all, a key idea may not be precisely supported by the maths, but still may be directionally correct.  Or as New York Fund Manager Daniel Shuchman says, we don’t need to touch the stove to prove it’s hot.

Questions:

What are the key lessons in the Rogoff and Reinhardt debacle?  Mistakes in treating correlation for causation? Sloppy coding? Applying too much historical data where conditions may have changed?  Applying too little data (cherry-picking)? What say you?

Analytics and Hedgehogs: Lessons from the Tampa Bay Rays

The Tampa Bay Rays spend significantly less on payroll than some of the wealthier teams in Major League Baseball, but get results that are sometimes better than those that wildly overspend. The Tampa Bay Rays success boils down to two things – understanding how to be a hedgehog, and continual application of statistics and analytics into daily processes.

Tampa Bay RaysGreek poet Archilochus once said: “The fox knows many things, but the hedgehog knows one big thing.” Many interpretations of this phrase exist, but one characterization is the singular focus on a particular discipline, practice or vision.

According to a Sports Illustrated article “The Rays Way”, while Major League Baseball teams such as the Los Angeles Angels load up on heavy hitters such as Albert Pujols and Josh Hamilton, the Tampa Bay Rays instead have a hedgehog-like and almost maniacal spotlight on pitching.

For example, SI writer Tom Verducci says “The Rays are to pitching what Google is to algorithms.” In essence, the Rays have codified methods (on how to raise up young pitchers and injury prevention techniques) and daily processes (including exclusive stretching and strengthening routines) into a holistic philosophy of “pitching first”.

But enabling that hedgehog-like approach to pitching is a culture of measurement and analysis.  To illustrate, the SI article mentions that pitchers are encouraged to have a faster delivery (no more than 1.3 seconds should elapse between a pitch and hitting the catcher’s glove). Pitchers are also instructed to throw the changeup on 15% of deliveries. And while other pitchers try and focus on getting ahead of batters, the Rays have discovered it’s the first three pitches that matter, with the third being the most important.

In terms of applying analytics, the Rays rely on a small staff of “Moneyball” statistical mavens that provide pitchers with a daily dossier of the hitters they’ll likely face, including they pitches they like and those they hate. And analytics also plays a part in how the Rays position their outfield and infielders to field balls that might otherwise go into the books as hits.

The Rays are guarded about sharing their proprietary knowledge on processes and measurement, and for good reason, as last year they had the lowest earned run average (ERA) in the American League and held batters to the lowest batting average (.228) in forty years. Even better, they’ve done this while spending ~70% less than other big market teams and winning 90+ games three years in a row. That’s nailing the hedgehog concept perfectly!

Seeing a case study like this, where a team or organization spends significantly less than competitors and gets better results, can be pretty exciting. However, an element of caution is necessary. It’s not enough to simply follow the hedgehog principle.

The strategy of a hedgehog-like “focus” can be highly beneficial, but in the case of the Tampa Bay Rays, it’s the singular focus on a critical aspect of baseball (i.e. pitching), joined with analytical processes, skilled people and the right technologies that really produce the winning combination.

Technologies and Analyses in CBS’ Person of Interest

Person of Interest is a broadcast television show on CBS where a “machine” predicts a person most likely to die within 24-48 hours. Then, it’s up to a mercenary and a data scientist to find that person and help them escape their fate. A straight forward plot really, but not so simple in terms of the technologies and analyses behind the scenes that could make a modern day prediction machine a reality. I have taken the liberty of framing some components that could be part of such a project.  Can you help discover more?

CBSIn Person of Interest, “the machine” delivers either a single name or group of names predicted to meet an untimely death. However, in order to predict such an event, the machine must collect and analyze reams of big data and then produce a result set, which is then delivered to “Harold” (the computer scientist).

In real life, such an effort would be a massive undertaking on a national basis, much less by state or city. However, let’s dispense with the enormities—or plausibility of such a scenario and instead see if we can identify various technologies and analyses that could make a modern day “Person of Interest” a reality.

It is useful to think of this analytics challenge in terms of a framework: data sources, data acquisition, data repository, data access and analysis and finally, delivery channels.

First, let’s start with data sources. In Person of Interest, the “machine” collects data from various sources such as interactions from: cameras (images, audio and video), call detail records, voice (landline and mobile), GPS for location data, sensor networks, and text sources (social media, web logs, newspapers, internet etc.). Data sets stored in relational databases that are publicly and not publicly available might also be used for predictive purposes.

Next, data must be assimilated or acquired into a data management repository (most likely a multi-petabyte bank of computer servers). If data are acquired in near real time, they may go into a data warehouse and/or Hadoop cluster (maybe cloud based) for analysis and mining purposes. If data are analyzed in real time, it’s possible that complex event processing technologies (i.e. streams in memory) are used to analyze data “on the fly” and make instant decisions.

Analysis can be done at various points—during data streaming (CEP), in the data warehouse after data ingest (which could be in just a few minutes), or in Hadoop (batch processed).  Along the way, various algorithms may be running which perform functions such as:

  • Pattern analysis – recognizing and matching voice, video, graphics, or other multi-structured data types. Could be mining both structured and multi-structured data sets.
  • Social network (graph) analysis – analyzing nodes and links between persons. Possibly using call detail records, web data (Facebook, Twitter, LinkedIn and more).
  • Sentiment analysis – scanning text to reveal meaning as in when someone says; “I’d kill for that job” – do they really mean they would murder someone, or is this just a figure of speech?
  • Path analysis – what are the most frequent steps, paths and/or destinations by those predicted to be in danger?
  • Affinity analysis – if person X is in a dangerous situation, how many others just like him/her are also in a similar predicament?

It’s also possible that an access layer is needed for BI types of reporting, dashboard, or visualization techniques.

Finally, delivery of the result set –in this case – name of the person “the machine” predicts most likely to be killed in the next twenty four hours, could be sent to a device in the field either a mobile phone, tablet, computer terminal etc.

These are just some of the technologies that would be necessary to make a “real life” prediction machine possible, just like in CBS’ Person of Interest. And I haven’t even discussed networking technologies (internet, intranet, compute fabric etc.), or middleware that would also fit in the equation.

What technologies are missing? What types of analysis are also plausible to bring Person of Interest to life? What’s on the list that should not be? Let’s see if we can solve the puzzle together!

Real-Time Pricing Algorithms – For or Against Us?

In 2012, Cyber Monday sales climbed 30% over the previous year’s results. Indeed, Cyber Monday benefits both online retailers as they gain massive Christmas spend in one day, and consumers can shop at work or home and thus skip holiday crowds.

And yet, underneath the bustle of ringing “cyber cash registers”, a battle brews as retailers now can easily change prices, even by the second, using sophisticated algorithms to out-sell competitors. Consumers aren’t standing still though. They also have algorithmic tools available to help them determine the best prices.

Christmas ballLet’s say you are thinking about buying a big screen television from a major online retailer.  The price at 12 noon is $546.40, but you decide to go get some lunch to think about it. An hour later, you check back on that same item and now it’s priced at $547.50.  What gives?  Depending on your perspective, you’ll either end up being the beneficiary of algorithmic pricing models or the victim.

A Financial Times article notes the price of an Apple TV device sold by three major online retailers changed anywhere from 5-10% daily (both up and down) in late November. Some HDTVs changed prices by the hour.

These up to the minute changes are made possible by real time pricing algorithms that collect data from competitor websites and customer interactions on their own sites, and then make pricing adjustments based on inventory, margins, and competitive strategies.

An algorithm is really just a recipe if you will, codified into steps and executed at blinding speed by computers.  Thus, a pricing algorithm may be using inputs from competitor websites and other data sources, and then based on pre-defined logic, churn out a “price” that is then posted on a website. Typically this process is executed in seconds.

Thus, it is increasingly common –depending on the specific item, day, hour, or even minute—that prices of online items change in a moment’s notice. If keeping up with rapidly rising and falling prices seems like a shopper’s nightmare, you’re right. However, consumers also have tools to fight back.

The same FT article points out that some consumers are using websites such as Decide.com to determine the best if not the most “fair” price points. Using either Decide.com, or Decide’s convenient smartphone app, for an annual fee of $30, a consumer can access pricing predictions of items based on Decide’s predictive pricing algorithms.  Simply look up an item, and Decide.com gives its best prediction of when to buy an item and where.

Today, we take for granted that grocery store prices generally don’t change within the hour, and that prices at the gas pump (while sometimes changing intra-day) generally don’t change by the minute. As data collection processes move from overnight batch to near real time, expect more aggressive algorithmic pricing, coming to a grocer, gas pump—or theater near you!