Preserving Big Data to Live Forever

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

Long term horizon by Irargerich. Courtesy of Flickr.
Long term horizon by Irargerich. Courtesy of Flickr.

There’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

NSA and the Future of Big Data

The National Security Agency of the United States (NSA) has seen the future of Big Data and it doesn’t look pretty.  With data volumes growing faster than the NSA can store, much less analyze, if the NSA with hundreds of millions of dollars to spend on analytics is challenged, it raises the question; “Is there any hope for your particular company”?

Courtesy of Flickr. By One Lost Penguin

By now, most IT industry analysts accept the term “Big Data” is much more than data volumes increasing at an exponential clip. There’s also velocity, or speeds at which data are created, ingested and analyzed. And of course, there’s variety in terms of multi-structured data types including web logs, text, social media, machine data and more.

But let’s get back to data volumes. A commonly referenced report conducted by IDC mentions data volumes are more than doubling every two years. Now that’s exponential growth that Professor Albert Bartlett can appreciate!

What are consequences of unwieldy data volumes? For starters, it’s nearly impossible to effectively deal with the flood.

In James Bamford’s “Shadow Factory”, he mentions how the NSA is vigorously constructing data centers in remote and not so remote locations to properly store the “flood of data” captured from foreign communications including video, voice, text and spreadsheets.  One NSA director is quoted as saying; “Some intelligence data sources grow at a rate of four petabytes per month now…and the rate of growth is increasing!”

Building data centers and storing petabytes of data isn’t the end goal. What the NSA really needs is analysis. And in this area the NSA is falling woefully short, but not for lack of trying.

That’s because in addition to the fastest super computers from Cray and Fujitsu, the NSA needs programmers who can modify algorithms on the fly to account for new key words that terrorists or other foreign nationals may be using. The NSA also constantly seeks linguists to help translate, document and analyze various foreign languages (something computers struggle with—especially discerning sentiment and context).

According to Bamford, the NSA sifts through petabytes of data on a daily basis and yet the flood of data continues unabated.

In summary, for the NSA it appears there are more data to be stored and analyzed than budget to procure more supercomputers, programmers and analytic talent.  There’s just too much data and too little “intelligence” to let directors know what patterns, links and relationships are most important. One NSA director says; “We’ve been into the future and we’ve seen the problems of a “tidal wave” of data.”

So if one of the most powerful government agencies in the world is struggling with an exponential flood of big data, is there hope for your company?  For advice, we turn to Bill Franks, Chief Analytics Officer for Teradata.

In a Smart Data Collective article, Mr. Franks says that even though the challenge of Big Data may be initially overwhelming, it pays to eat an elephant a single bite at a time. “People need step back, push the hype from their minds, and think things through,” he says.  In other words, don’t stress about going big from day one.

Instead, Franks counsels companies to “start small with big data.”  Capture a bit at a time, gain value from your analysis and then collect more he says. There’s an overwhelming temptation to splurge on hundreds of machines and lots of software to capture and analyze everything. Avoid this route, and instead take the road less traveled—the incremental approach.

The NSA may be drowning in information, but there’s no need to inflict sleepless nights on your IT staff.  Think big data but start small. Goodness knows, in terms of data, there will always be plenty more to capture and analyze. The data flood will continue. And from a IT job security perspective, that’s a comforting thought.