Driving Data: A Slippery Ethical Slope?

Image courtesy of Flickr. By Michael Loke

When thinking about telematics, it’s easy to conjure up images of fleet tracking via GPS, satellite navigation systems for driving directions, or even the ubiquitous on-board security and diagnostic systems. However, what’s less understood is that data on your driving habits, locations and more are being collected, sometimes without your explicit knowledge.

Image courtesy of Flickr. By Michael Loke
Image courtesy of Flickr. By Michael Loke

Most people don’t realize that driving data are being collected in 80% of the cars sold in the United States.  According to an Economist article, event data recorders (EDRs) are installed in most cars to analyze how airbags are deployed.  Some EDRs can also record events such as “forward and sideway acceleration and deceleration, vehicle speed, engine speed and steering inputs.”

The Economist article also says EDR data can show if a driver stepped on the gas just before an accident, or how quickly brakes were applied. And EDRs can also record whether seat belts were locked. These data can be used to augment a police crash report, corroborate accident events as remembered by a driver, or even be used against a driver when negligence is suspected.

This brings to mind a key question – who owns this data? The Economist article says that if you are the car owner, it’s probably you. However, if your car is totaled from a crash, and you sell it to the insurance company as part of a claim resolution process, then it’s likely your insurance company now owns the data.

Data can be used for purposes advantageous and disadvantageous to a driver.

An MIT Technology Review article cites how a new $70 device is now available to hook into your car’s EDR. This device wirelessly transmits data via Bluetooth to your mobile phone on your driving efficiency, cost of your daily commute, and information on possible engine issues.  And the company providing the device can deliver a “score” for your driving habits, gas savings and safety in relation to other drivers.

Driving data can also be collected for things you did not intend. For example, a team of scientists used mobile phone location data gleaned from wireless networks to detect commute patterns from more than 1 million users over three weeks in the San Francisco Bay Area.

These scientists discovered “cancelling some car trips from strategically located neighborhoods could drastically reduce gridlock and traffic jams.”  In other words, some neighborhoods are responsible for a fair portion of Bay Area freeway congestion.  The scientists claimed by cancelling just 1% of trips from these neighborhoods, congestion for everyone else could be reduced by 14%.

Of course, drivers in urban areas could be incentivized to use public transportation, carpool or telecommute, but it’s also possible that a more heavy-handed government approach could restrict commutes from these neighborhoods—on certain days—“for the good of all.”

Data are of course, benign. However, driving data from GPS and other devices are collected daily—and sometimes without your consent.

Altruistically, these data may ultimately be used to design better cars, better freeways and improve the overall quality of life for everyone concerned. Yet, it’s also important to realize that mobile data from daily road travels can also be utilized for tracking purposes, to pin down exactly where you are located at any given moment in time, and how you arrived.

And that thought should give everyone pause.

Will Pay-Per-Use Pricing Become the Norm?

traffic jam

CIOs across the globe have embraced cloud computing for myriad reasons; however a key argument is cost savings. If a typical corporate server is utilized anywhere from 5-10% over the life of the asset, then it’s fair to argue the CIO paid ~10x too much for that asset (assuming full utilization). Thus to get better value,  a CIO then has two choices – embark on a server consolidation project—or use cloud computing models to access processing power and/or storage, when needed, on a metered basis.

Cloud computing isn’t the only place where utility based pricing is taking off. An article in the Financial Times shows how the use of “Big Data” in terms of volume, variability and velocity, is stoking a revolution in real-time, pay-per-use pricing models.

traffic jamThe FT article cites Progressive Insurance as an example. With the simple installation of a device that can measure driver speed, braking, location and other data points, Progressive can gather multiple data streams and compute a usage based pricing model for drivers that want to reduce premiums. For example, rates may vary depending on how hard a customer brakes, how “heavy they are on the accelerator”, or how many miles they drive.

The installed device works wirelessly to stream automobile data back to Progressive’s corporate headquarters, where billing computations take place in near real time.  Of course, the driver must be willing to embark upon such a pricing endeavor, and possibly lose some privacy freedoms, however this is often a small price to pay for the benefit of a pricing model that correlates safer driving habits with a lower insurance premium.

And this is just the tip of the iceberg. Going a step further to true utility based pricing, captured automobile data points also make it possible to create innovative pricing models based on other risk factors.

For example, if an insurance company decides it is riskier to drive to certain locales, or from 2am-5am, they can attach a “premium price” to those decisions, thus letting a driver choose their insurance rate.  Even more futuristic, it might be possible to be charged more or less based on discovery of how many passengers are driving with you!

Whether it is utility based pricing of electricity based on time of day, cloud computing, or even pay as you go insurance, with the explosion of “big data” and other technologies, it’s already possible to stream and collect various data, calculate a price and then bill a customer in a matter of minutes.  The key consideration will be consumer acceptance of such pricing models (considering various privacy tradeoffs) and adoption rates.

If the million “data collection” devices Progressive has installed are any indication, much less the general acceptance of utility priced cloud computing models, it appears we’ve embarked upon a journey in which it’s far too late to go back home.

Private Clouds Are Here to Stay—Especially for Data Warehousing

Some cloud experts are proclaiming private clouds “are false clouds”, or that the term was conveniently conjured to support vendor solutions. There are other analysts willing to hedge their bets by proclaiming that private clouds are a good solution for the next 3-5 years until public clouds mature.  I don’t believe it. Private clouds are here to stay (especially for data warehousing)—let me tell you why.

For starters, let’s define public vs. private cloud computing.  NIST and others do a pretty good job of defining public clouds and their attributes.  They are remote computing services that are typically elastic, scalable, use internet technologies, self-service, metered by use and more.  Private clouds, on the other hand are proprietary and typically behind the corporate firewall. And they frequently share most of the characteristics of public clouds.

However, there is one significant difference between the two cloud delivery models –public clouds are usually multi-tenant (i.e. shared with other entities/corporations/enterprises). Private clouds are typically dedicated to a single enterprise – i.e. not shared with other firms. I realize the above definitions are not accepted by all cloud experts, but they’re common enough to set a foundation for the rest of the discussion.

With the definition that private clouds equate to a dedicated environment for a single or common enterprise, it’s easy to see why they’ll stick around—especially for data warehousing workloads.

First, there’s the issue of security. No matter how “locked down” or secure a public cloud environment is said to be, there’s always going to be an issue of trust that will need to be overcome by contracts and/or SLAs (and possibly penalties for breaches).  Enterprises will have to trust that their data is safe and secure—especially if they plan on putting their most sensitive data (e.g. HR, financial, portfolio positions, healthcare and more) in the public cloud.

Second, there’s an issue of performance for analytics.  Data warehousing requirements such as high availability, mixed workload management, near real-time data loads and complex query execution are not easily managed or deployed using public cloud computing models. By contrast, private clouds for data warehousing offer higher performance and predictable service levels expected by today’s business users. There are myriad other reasons why public clouds aren’t ideal for data warehousing workloads and analyst Mark Madsen does a great job of explaining them in this whitepaper.

Third, in the multi-tenant environment of public cloud computing, there is increasing complexity which will lead to more cloud breakdowns. In a public cloud environment there are lots of moving pieces and parts interacting with each other (not necessarily in a linear fashion) within any given timeframe. These environments can be complex and tightly coupled where failures in one area easily cascade to others. For data warehousing customers with high availability requirements public clouds have a long way to go.  And the almost monthly “cloud breakdown” stories blasted throughout the internet aren’t helping their cause.

Finally, there’s the issue of control. Corporate IT shops are mostly accustomed to having control over their own IT environments. In terms of flexibly outsourcing some IT capabilities (which is what public cloud computing really is), IT is effectively giving up some/all control over their hardware and possibly software.  When there are issues and/or failures, IT is relegated to opening up a trouble ticket and waiting for a third party provider to remedy the situation (usually within a predefined SLA).  In times of harmony and moderation, this approach is all well and good. But when the inevitable hiccup or breakdown happens, it’s a helpless feeling to be at the mercy of another provider.

When embarking on a public cloud computing endeavor, a company or enterprise is effectively tying their fate to another provider for specific IT functions and/or processes.   Key questions to consider are:

  • How much performance do I need?
  • What data do I trust in the cloud?
  • How much control am I willing to give up?
  • How much risk am I willing to accept?
  • Do I trust this provider?

There are many reasons why moving workloads to the public cloud makes sense, and in fact your end-state will likely be a combination of public and private clouds.  But you’ll only want to consider public cloud after you carefully think about the above questions.

And inevitably, once answers to these questions are known, you’ll also conclude private clouds are here to stay.


Data, Feces and the Future of Healthcare

University of California computer scientist Dr. Larry Smarr is a man on a mission—to measure everything his body consumes, performs, and yes, discharges. For Dr. Smarr, this data collection has a goal –to fine tune his ecosystem in order to beat a potentially incurable disease. Is this kind of rigorous information collection and analysis the future of healthcare?

Talk to a few friends and you’ll probably find those who count calories, steps, or even chart exercise and/or eating regiments.  But it’s not very likely that your friends are quantifying their personal lives like Larry Smarr.

Atlantic Magazine’s June/July 2012 issue describes efforts of Dr. Larry Smarr in capturing his personal data – but not necessarily those of financial or internet viewing habits. Dr. Smarr is capturing health data, and lots of it. He uses armbands to record skin temperature, headbands to monitor sleep patterns, has blood drawn eight times a year, MRIs and ultrasounds when needed, and regular colonoscopies. And of course, he writes down every bite of food and also collects his own stool samples and then ships them to a laboratory.

Monitoring calories makes sense, but stools are also “information rich” says Smarr. “There are about 100 billion bacteria per gram. Each bacterium has DNA whose length is typically one to ten megabases—call it one million bytes of information,” Smarr exclaims. “This means human stool has a data capacity of 100,000 terabytes of information (~97 petabytes) stores per gram.” And all kinds of interesting information on the digestive tract, liver and pancreas can be culled from feces including infection, nutrient absorption and even cancer.

Armed with all this health data, Dr. Smarr is attempting to “model” his ecosystem. This means producing a working model that when fed inputs, can help report, analyze and eventually predict potential health issues. Just as sensor and diagnostic data are useful for auto manufacturers to perform warranty and quality analysis, Dr. Smarr is collecting and analyzing data to fine tune how his human body performs its functions.

But there’s more to the story. In his charting process, Dr. Smarr noticed his C-reactive protein (CRP) count was high—which rises in response to inflammation.  “Troubled, I showed my graphs to my doctors and suggested that something bad was about to happen,” he says.  Believing his higher CRP count was acting as an early warning system, Carr was dismissed by doctors as too caught up in finding a problem where there was none.

Two weeks later Dr. Smarr felt a severe pain in the side of his abdomen.  This time, the doctors diagnosed him with an acute bout of diverticulitis (bowel inflammation) and told him to take antibiotics. But Dr. Smarr wasn’t convinced. He tested his stools and came up with additional alarming numbers that suggested his diverticulitis was perhaps something more—early Crohn’s disease which is an incurable and uncomfortable GI tract condition.  The diagnosis of Crohn’s was subsequently confirmed by doctors.

Critics of “measuring everything” in terms of healthcare suggest that by focusing on massive personal data collection and analysis we’ll all turn into hypochondriacs, looking for ghosts in the machine when there are none. Or, as Nassim Taleb argues; the more variables we test, the disproportionately higher the number of spurious results that appear (to be)”statistically significant”.  And there is also the argument is that predictive analytics may do more harm than good in suggesting potential for illness where a patient may never end up developing a given disease. Correlation is not a cause in other words.

That said, you’d have a hard time convincing Dr. Smarr that patients, healthcare providers and even society at large couldn’t benefit more by quantifying and analyzing inputs, outputs thus gaining a better understanding of our own “system health”.  And fortunately, due to Moore’s Law and today’s software applications, our ability to apply brute force computation to our data-rich problems is now not only possible, it’s available now.

However, what sometimes makes sense conceptually is often much more of a difficult implementation in the real world. A sluggish healthcare system, data privacy issues, and lack of data scientists to perform big data analysis are potential roadblocks in seeing the “quantified life”—for everyone—become a reality any time soon.


  • Does data collection and analysis methods as described in this article portend a revolution in healthcare?
  • If everyone rigorously collects and analyzes their personal health data, could this end up raising or reducing overall healthcare costs?

What’s Next – Predictive “Scores” for Health?

In the United States health information privacy is protected by the Health Information Portability and Accountability (HIPAA) act.  However, new gene sequencing technologies are now available making it feasible to read an individual’s DNA for as little as $1,000 USD.  If there is predictive value in reading a person’s gene sequence, what are implications of this advancement? And will healthcare data privacy laws be enough to protect employees from discrimination?

The Financial Times reports a breakthrough in technology for gene sequencing, where a person’s chemical building blocks can be catalogued—according to one website—for scientific purposes such as exploration of human biology and other complex phenomena. And whereas DNA sequencing was formerly a costly endeavor, the price has dropped from $100 million to just under $1,000 per genome.

These advances are built on the back of Moore’s Law where computation power doubles every 12-18 months paired with plummeting data storage costs and very sophisticated software for data analysis.  And from a predictive analytics perspective, there is quite a bit of power in discovering which medications might work best for a certain patient’s condition based on their genetic profile.

However, as Stan Lee’s Spiderman reminds us, with great power comes great responsibility.

The Financial Times article mentions; “Some fear scientific enthusiasm for mass coding of personal genomes could lead to an ethical minefield, raising problems such as access to DNA data by insurers.”  After all, if indeed there is predictive value via analyzing a patient’s genome, it might be possible to either offer or deny that patient health insurance—or employment—based  on potential risks of developing a debilitating disease.

In fact, it may become possible in the near future to assign a certain patient or group of patients something akin to a credit score based on their propensity to develop a particular disease.

And something like a predictive “score” for diseases isn’t too outlandish a thought, especially when futurists such as Aaron Saenz forecast; “One day soon we should have an understanding of our genomes such that getting everyone sequenced will make medical sense.”

Perhaps in the near future, getting everyone sequenced may make medical sense (for both patient and societal benefit) but there will likely need to be newer and more stringent laws—and associated penalties for misuse) to ensure such information is protected and not used for unethical purposes.


  • With costs for DNA sequencing now around $1000 per patient, it’s conceivable universities, research firms and other companies will pursue genetic information and analysis. Are we opening Pandora’s Box in terms of harvesting this data?

Has Personalized Filtering Gone Too Far?

Too much Information

In a world of plenty, algorithms may be our saving grace as they map, sort, reduce, recommend, and decide how airplanes fly, packages ship, and even who shows up first in online dating profiles. But in a world where algorithms increasingly determine what we see and don’t see, there’s danger of filtering gone too far.

The global economy may be a wreck, but data volumes keep advancing. In fact, there is so much information competing for our limited attention, companies are increasingly turning to compute power and algorithms to make sense of the madness.

The human brain has its own methods for dealing with information overload. For example, think about millions of daily input the human eye receives and how it transmits and coordinates information with our brain. A task as simple as stepping a shallow flight of stairs takes incredible information processing. Of course, not all received data points are relevant to the task of walking a stairwell, and thus the brain must decide which data to process and which to ignore. And with our visual systems bombarded with sensory input from the time we wake until we sleep, it’s amazing the brain can do it all.

But the brain can’t do it all—especially not with the onslaught of data and information exploding at exponential rates. We need what author Rick Bookstaber calls “artificial filters,” computers and algorithms to help sort through mountains of data and present the best options. These algorithms are programmed with decision logic to find needles in haystacks, ultimately presenting us with more relevant choices in an ocean of data abundance.

Algorithms are at work all around us. Google’s PageRank presents us relevant results—in real time—captured from web server farms across the globe. Match.com sorts through millions of profiles, seeking compatible profiles for subscribers. And Facebookshows us friends we should “like.”

But algorithmic programming can go too far. As humans are more and more inundated with information, there’s a danger in turning over too much “pre-cognitive” work to algorithms. When we have computers sort friends we would “like”, pick the most relevant advertisements or best travel deals, and choose ideal dating partners for us, there’s a danger in missing the completely unexpected discovery, or the most unlikely correlation of negative one. And even as algorithms “watch” and process our online behavior and learn what makes us tick, there’s still a high possibility that results presented will be far and away from what we might consider “the best choice.”

With a data flood approaching, there’s a temptation to let algorithms do more and more of our pre-processing cognitive work. And if we continue to let algorithms “sort and choose” for us – we should be extremely careful to understand who’s designing these algorithms and how they decide. Perhaps it’s cynical to suggest otherwise, but in regards to algorithms we should always ask ourselves, are we really getting the best choice, or getting the choice that someone or some company has ultimately designed for us?

*  Rick Bookstaber makes the case that personalized filters may ultimately reduce human freedom. He says, “If filtering is part of thinking, then taking over the filtering also takes over how we think.” Are there dangers in too much personalized filtering?

Data Tracking for Asthma Sufferers?

Despite the recent privacy row with smartphones and other GPS enabled devices, a Wisconsin doctor is proposing use of an inhaler with built in global positioning system to track where and when asthma sufferers use their medication. By capturing data on inhaler usage, the doctor proposes that asthma sufferers can learn more about what triggers an attack and the medical community can learn more about this chronic condition. However, the use of such a device has privacy implications that need serious consideration.

For millions of people on a worldwide basis, asthma is no joke. An April 9, 2011 Economist article mentions that asthma affects more than 300 million people, almost 5% of the world’s population.

Scientists and the medical community have long pondered the question; ‘What triggers an asthma attack?’ Is it pollen, dust in the air, mold spores or other environmental factors? The key to learning the answer to this question is not only relevant for asthma sufferers themselves, but also society (and healthcare costs) as there are more than 500,000 asthma related hospital admissions every year.

In an effort to better understand factors behind asthma attacks, Dr. David Van Sickle, co-founded a company that makes an inhaler with GPS to track usage. Van Sickle once worked for the Centers for Disease Control (CDC), and he believes that with better data society can understand asthma in a deeper manner.  By capturing data on asthma inhaler usage and then plotting the results with visualization tools, Van Sickle hopes that this information can be sent back to primary care physicians to help patients understand asthma triggers.

A better understanding of asthma makes sense for patients, health insurers and society at large. The Economist article notes that pilot studies of device usage thus far have resulted in basic understandings of asthma coming into question. However, there are surely privacy implications in the capture, management and use of this data, despite reassurances from the medical community that data will be anonymized and secured.

Should societal and patient benefits outweigh privacy concerns when it comes to tracking asthma patients? What do you think?  I’d love to hear from you.