Preserving Big Data to Live Forever

If anyone knows how to preserve data and information for long term value, it’s the programmers at Internet Archive, based in San Francisco, CA.  In fact, Internet Archive is attempting to capture every webpage, video, television show, MP3 file, or DVD published anywhere in the world. If Internet Archive is seeking to keep and preserve data for centuries, what can we learn from this non-profit about architecting a solution to keep our own data safeguarded and accessible long-term?

Long term horizon by Irargerich. Courtesy of Flickr.

Long term horizon by Irargerich. Courtesy of Flickr.

There’s a fascinating 13-minute documentary on the work of data curators at the Internet Archive. The mission of the Internet Archive is “universal access to all data”. In their efforts to crawl every webpage, scan every book, and make information available to any citizen of the world, the Internet Archive team has designed a system that is resilient, redundant, and highly available.

Preserving knowledge for generations is no easy task. Key components of this massive undertaking include decisions in technology, architecture, data storage, and data accessibility.

First, just about every technology used by Internet Archive, is either open source software or commodity hardware. For web crawling and adding content to their digital archives Heritrix was developed by Internet Archive. To enable full text search on Internet Archive’s website, Nutch running on Hadoop’s file system is utilized to “allow Google-style full-text search of web content, including the same content as it changes over time.”  There are also web sites that mention HBase could also be in the mix as a database technology.

Second, the concepts of redundancy and disaster planning are baked into the overall Internet Archive architecture. The non-profit has servers located in San Francisco, but in keeping a multi-century and beyond vision, Internet Archive mirrors data in Amsterdam and Egypt to weather the volatility of historical events.

Third, many companies struggle to decide what data they should use, archive, or throw away. However with the plummeting cost of hard disk storage, and open source Hadoop, capturing and storing all data in perpetuity is more feasible than ever. For Internet Archive all data are captured and nothing is thrown away.

Finally, it’s one thing to capture and store data, and another to make it accessible. Internet Archive aims to make the world’s knowledge base available to everyone. On the Internet Archive site, users can search and browse through ancient documents, view recorded video from years past and listen to music from artists that no longer walk planet earth. Brewster Kahle, founder of the Internet Archive says, that with a simple internet connection; “A poor kid in Keyna or Kansas can have access to…great works no matter where they are, or when they were (composed).”

Capturing a mountain of multi-structured data (currently 10 petabytes and growing) is an admirable feat, however the real magic lies in Internet Archive’s multi-century vision of making sure the world’s best and most useful knowledge is preserved. Political systems come and go, but with Internet Archive’s Big Data preservation approach, the treasures of the world’s digital content will hopefully exist for centuries to come.

No Gold Medals for “Black Swan” Criers?

It’s extremely unfashionable to be the “Black Swan” crier in your organization, or the person who warns line of business managers about the heavy impact of extreme but unlikely events.  In fact just the opposite is the norm, where plenty of company executives get rewarded in career growth and compensation for ignoring risks, or sweeping them under the rug for others to tackle down the road.  It’s time to listen—really listen—to what Black Swan criers in your own company are saying.

Courtesy of Flickr. By Al S

Courtesy of Flickr. By Al S

In 18th century England, the town crier would be dressed in fine clothing, given a bell, and told to “cry” or proclaim significant news to merchants and citizens alike. Sometimes the town crier brought bad news—such as tax increases. Fortunately, such a person was protected by laws stating that anyone causing harm to the town crier could be convicted of treason.  Wikipedia notes the phrase; “don’t shoot the messenger” was a real command!

Fast forwarding to our current time, there are few rewards for those who “cry” or warn about the dangers of “Black Swans” or extreme but rare events that carry a high impact.  See here for a list of “Black Swan” events since 2001.

Case in point, leading up to the September 2008 financial crisis, only a few prognosticators could see that quasi-government agencies such as Fannie Mae and Freddie Mac were buying too many no-documentation, no-income (NINJA) loans that could go bust if the US economy went into recession.  Nassim Taleb, author of the Black Swan, was a key figure that needed no more than a glance at these agency’s financials in 2007 to declare, “(They seem) to be sitting on a barrel of dynamite, vulnerable to the slightest hiccup”.

And of course, that dynamite was lit as the global economy teetered on the edge of major depression, and the agencies ultimately lost a combined $15B. Of course, Mr. Taleb was ridiculed as a “clown” and “rabble rouser” for many of his prognostications.

Today’s corporate potential whistleblowers don’t fare much better in terms of warning about everyday risks whether they reside in supply chains, nuclear power plants, cloud computing infrastructures or other such complex systems prone to fragility. It’s much easier to carry on with business as usual, than plan and prepare for events that however unlikely, could end up disabling or dismantling your organization in one fell swoop.

Indeed, Taleb argues it’s much easier for managers to tout what they “did”, rather than what they avoided by taking proper risk management precautions.  “The corporate manager who avoids a loss will often not be rewarded,” he says.

Business executives should not turn their eyes and ears from their own “town criers” preaching Black Swans. While painful to listen to, and sometimes counter-intuitive for today’s “business wisdom”, those closest to your business operations often see what can blow up, long your before mid-level and corporate executives gain visibility.

These “Black Swan” criers may never be personally rewarded with a gold medal for highlighting key risks, but it’s the smart business that ultimately finds a way to seek their opinions and at least scenario plan for their noted “worst case event” outcomes.

Better Capacity Management ABCs: “Always Be (Thinking) Cloud”

Sales personnel have a mantra, “ABC” or “Always Be Closing,” as a reminder to continually drive conversations to selling conclusions or move on. In a world where business conditions remain helter-skelter, traditional IT capacity management techniques are proving insufficient. It’s time to think different – or “ABC”: Always Be (Thinking) Cloud.

Getting more for your IT dollar is a smart strategy, but running your IT assets at the upper limits of utilization—without a plan to get extra and immediate capacity at a moment’s notice—isn’t so brainy. Let me explain why.

Courtesy of Flickr. By M Hooper

Courtesy of Flickr. By M Hooper

Author Nassim Taleb writes in his latest tome, “Anti-Fragility,” about how humans are often unprepared for randomness and thus fooled into believing that tomorrow will be much like today. He says we often expect linear outcomes in a complex and chaotic world, where responses and events are frequently not dished out in a straight line.

What exactly does this mean? Dr. Taleb often bemoans our pre-occupation with efficiency and optimization at the expense of reserving some “slack” in systems.

For example, he cites London’s Heathrow as one of the world’s most “over-optimized” airports. At Heathrow, when everything runs according to plan, planes depart on time and passengers are satisfied with airline travel. However, Dr. Taleb says that because of over-optimization, “the smallest disruption in Heathrow causes 10-15 hour delays.”

Bringing this back to the topic at hand, when a business runs its IT assets at continually high utilization rates it’s perceived as a beneficial and positive outcome. However, running systems at near 100% utilization offers little spare capacity or “slack” to respond to changing market conditions without affecting expectations (i.e. service levels) of existing users.

For example, in the analytics space, running data warehouse and BI servers at high utilization rates makes great business sense, until you realize that business needs constantly change: new users and new applications come online (often as mid-year requests), and data volumes continue to explode at an exponential pace. And we haven’t even yet mentioned corporate M&A activities, special projects from the C-suite, or unexpected bursts of product and sales activity. In a complex and evolving world, solely relying on statistical forecasts (i.e. linear or multiple linear regression analysis) isn’t going to cut it for capacity planning purposes.

On premises “capacity on demand” pricing models and/or cloud computing are possible panaceas for better reacting to business needs by bursting into extra compute, storage and analytic processing when needed. Access to cloud computing can definitely help “reduce the need to forecast” for traffic.

However, many businesses won’t have a plan in place, much less the capability or designed processes—at the ready—to access extra computing power or storage at a moment’s notice. In other words, many IT shops know “the cloud” is out there, but they have no idea how they’d access what they need without a whole lot of research and planning first. By then, the market opportunity may have passed.

Businesses must be ready to scale (where possible) to more capacity in minutes or hours—not days, weeks or months. This likely means having a cloud strategy in place, completion of vendor negotiation (if necessary), adaptable and agile business processes, identifying and architecting workloads for the cloud, and a tested “battle plan” so that when demands for extra resources filter in, you’re ready to respond to whatever the volatile marketplace requires.

Of Black Swans and Taking Showers

Those pesky “Black Swans” – extremely low probability events with a large impact have been burned into the lexicon of just about every MBA.  With “Black Swan awareness” in mind, executives are counseled to be on the prowl for unlikely events that could potentially wipe out their headquarters, disrupt supply chains, bankrupt the company, or even lead to employee deaths. And while Black Swans are definitely reasons to seek safer ground, it’s inevitable the biggest risk is right under your nose!

Courtesy of Flickr, by kingslandscaping.ca

Courtesy of Flickr, by kingslandscaping.ca

Author Jared Diamond writes for the Weekend Financial Times about the importance of understanding everyday risks. In studies of New Guineans, he cites how jungle dwellers will refuse to camp out underneath a dead tree. That’s because even though the risk of such a tree falling on you is 1 in 1000; if one does enough camping under dead trees eventually it will lead to an untimely death. Diamond writes; “New Guineans have learnt from experience which are the real dangers in their lifestyle and they remain constantly alert to those dangers.”

Speaking of daily dangers, Diamond mentions for older adults, it’s more likely you’ll die slipping and falling in the shower, than in any horrific event your mind can conjure.

I blame the media for this over-estimation of extreme events. Simply open a newspaper and you’ll think the world is coming to an end. Fires, earthquakes, wars, pestilence and central bankers (yes, the inclusion of central bankers in “disasters” is intentional) are prominent.  With evidence of just your local paper or watching one hour of CNN, it’s easy to believe the world is filled with one extreme disaster after another.

And while Black Swans are definitely something to prepare for in terms of creating more robust or anti-fragile structures, it’s often daily events that are more likely to hurt us. Case in point, while dying in an airplane crash is a horrific way to check out (1:250,000 probability of death in a given year), it’s much more statistically likely that you’ll meet your maker traveling in a car to your local grocery store (1:5000).

So, yes, prepare as much as you can for extreme events.  Identify the “known knowns” for which you have probability theory to assist, the “known unknowns” where Bayes might help, and try to build robustness for the very infrequent Black Swan type “unknown unknowns”.

But in all this, please pay attention to the risks closer to home, those dangers you might face every day. And watch out when taking a shower. That first step in, could be quite a doozy.

Private Clouds Are Here to Stay—Especially for Data Warehousing

Some cloud experts are proclaiming private clouds “are false clouds”, or that the term was conveniently conjured to support vendor solutions. There are other analysts willing to hedge their bets by proclaiming that private clouds are a good solution for the next 3-5 years until public clouds mature.  I don’t believe it. Private clouds are here to stay (especially for data warehousing)—let me tell you why.

For starters, let’s define public vs. private cloud computing.  NIST and others do a pretty good job of defining public clouds and their attributes.  They are remote computing services that are typically elastic, scalable, use internet technologies, self-service, metered by use and more.  Private clouds, on the other hand are proprietary and typically behind the corporate firewall. And they frequently share most of the characteristics of public clouds.

However, there is one significant difference between the two cloud delivery models –public clouds are usually multi-tenant (i.e. shared with other entities/corporations/enterprises). Private clouds are typically dedicated to a single enterprise – i.e. not shared with other firms. I realize the above definitions are not accepted by all cloud experts, but they’re common enough to set a foundation for the rest of the discussion.

With the definition that private clouds equate to a dedicated environment for a single or common enterprise, it’s easy to see why they’ll stick around—especially for data warehousing workloads.

First, there’s the issue of security. No matter how “locked down” or secure a public cloud environment is said to be, there’s always going to be an issue of trust that will need to be overcome by contracts and/or SLAs (and possibly penalties for breaches).  Enterprises will have to trust that their data is safe and secure—especially if they plan on putting their most sensitive data (e.g. HR, financial, portfolio positions, healthcare and more) in the public cloud.

Second, there’s an issue of performance for analytics.  Data warehousing requirements such as high availability, mixed workload management, near real-time data loads and complex query execution are not easily managed or deployed using public cloud computing models. By contrast, private clouds for data warehousing offer higher performance and predictable service levels expected by today’s business users. There are myriad other reasons why public clouds aren’t ideal for data warehousing workloads and analyst Mark Madsen does a great job of explaining them in this whitepaper.

Third, in the multi-tenant environment of public cloud computing, there is increasing complexity which will lead to more cloud breakdowns. In a public cloud environment there are lots of moving pieces and parts interacting with each other (not necessarily in a linear fashion) within any given timeframe. These environments can be complex and tightly coupled where failures in one area easily cascade to others. For data warehousing customers with high availability requirements public clouds have a long way to go.  And the almost monthly “cloud breakdown” stories blasted throughout the internet aren’t helping their cause.

Finally, there’s the issue of control. Corporate IT shops are mostly accustomed to having control over their own IT environments. In terms of flexibly outsourcing some IT capabilities (which is what public cloud computing really is), IT is effectively giving up some/all control over their hardware and possibly software.  When there are issues and/or failures, IT is relegated to opening up a trouble ticket and waiting for a third party provider to remedy the situation (usually within a predefined SLA).  In times of harmony and moderation, this approach is all well and good. But when the inevitable hiccup or breakdown happens, it’s a helpless feeling to be at the mercy of another provider.

When embarking on a public cloud computing endeavor, a company or enterprise is effectively tying their fate to another provider for specific IT functions and/or processes.   Key questions to consider are:

  • How much performance do I need?
  • What data do I trust in the cloud?
  • How much control am I willing to give up?
  • How much risk am I willing to accept?
  • Do I trust this provider?

There are many reasons why moving workloads to the public cloud makes sense, and in fact your end-state will likely be a combination of public and private clouds.  But you’ll only want to consider public cloud after you carefully think about the above questions.

And inevitably, once answers to these questions are known, you’ll also conclude private clouds are here to stay.