Preserving Big Data to Live Forever

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

Long term horizon by Irargerich. Courtesy of Flickr.
Long term horizon by Irargerich. Courtesy of Flickr.

There’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

In the Future, Will Software Be More Important than Hardware?

From  talent wars going on in Silicon Valley for software engineers, to the hundreds of thousands of new smartphone applications coming online, it’s not far-fetched to believe that software rules the world today and will continue to rule in the future. However, some hardware makers strongly disagree- that it’s the physical design, construction and production of the device, machine or infrastructure that will take precedence. Who holds the future – hardware makers, software makers—or both?

Flickr for android, courtesy of Flickr.

A Financial Times article by Andrew Keen highlights a brewing battle between hardware and software makers for investor dollars. Both sides believe that they are the smarter investment for the long run. And both have a point.

First, it’s tempting to see hardware manufacturing as nothing more than something that should be outsourced. After all, companies such as Amazon source the production of the Kindle to offshore manufacturers, and it’s commonly understood that most large computer companies leave production of machines to Chinese/Taiwanese contract manufacturers such as Flextronics, FoxConn and others.

However, increasingly companies such as CPU manufacturers and tablet makers are taking some of these manufacturing capabilities in-house, especially as product complexity increases and integration between software and hardware becomes more commonplace.

In addition, taking manufacturing capabilities in-house means less bureaucracy in terms of working with an outsourced vendor, arguably higher accountability (no one to blame for failures), and more control over manufacturing processes. Net, net in many cases the higher a product moves up the value chain in terms of complexity and integration, the more it makes sense for companies to assert authority, control and accountability for manufacturing operations—sometimes all the way to the point of assuming full responsibility for hardware production.

The counter argument however is hardware will always be a commodity. Designs and specs can be written so that just about any respectable contract manufacturer can produce a product. The real value, say software makers is the design of user interfaces all the way to behind the scenes algorithms responsible for executing complex processes.

Proof points for the “software will rule” camp include software companies gaining a bigger slice of VC funding, and the number of applications developed for iPhone (650k) and Android (400k). For further reading on this perspective, review VC and market maker Marc Andreessen’s comments.

Ultimately, the most likely answer of who will win the future (hardware vs. software) is that there’s a place for both camps. For example, it’s the integration of commodity hardware with advanced software that seems to be the best fit for many companies looking to acquire analytics capabilities.

This is evidenced by the data warehouse appliance trend of an engineered and integrated solution stack of hardware and software coupled with services for implementation, maintenance and operations. These solution stacks are architected, performance tested, certified and supported. And they usually come from a single vendor responsible for the entire end-to-end package.

In the meantime, we have a strong debate. VC’s like Marc Andreessen say software companies are primed to “take over large swathes of the economy”. Hardware makers claim the user experience in terms of design, touch and feel is more relevant than ever. What say you?