Big data. The cloud. These are two buzzwords that have been gaining momentum and just won’t go away.
You can stick your fingers in your ears if you want to. But as soon as you take them out you’ll hear someone talking about big data in the cloud.
What’s so big about big data? It’s simply a pile of data that is too big to fit on your hard drive. Since it’s that big it’s also a real pain to do anything with.
What’s so cloudy about the cloud? It’s simply a way to access all that big data from wherever it is you need to access it. Since all that big data doesn’t fit on your hard drive, it needs to be stored somewhere. That somewhere is a giant pile of servers called "the cloud" that exists "somewhere else."
So how does all this data come about? Why, after more than 25 years of personal computing, has this issue of giant piles of data just now started to become a big deal? Are people really doing more transactions and sales and deals than they were before? Has the volume of all these activities increased so much that we’ve outstripped the supply of hard drives in existence?
There are a handful of things driving big data:
- Things that weren’t being tracked and measured are now being tracked and measured. This generates a pile of data.
- New activities that didn’t exist before are now being tracked and measured. This generates a pile of data.
- Robots are generating piles of data.
Basically, the whole computer revolution plus Internet scenario made it so easy to track and store data that people did. And as our ability to track and store new types of data (think Social Graph of sites like Facebook, or the Internet of Things, or GPS tracks) comes on line we continue to track it. In many cases, simply because we can.
"Oh, it’s one extra line of code at the bottom of the website and then we can track a zillion different things about our Web visitors? Sure, let’s do it."
But what’s really starting to pile up the data is the "robots" or scripts. Now we have tracking robots tracking the activity of tracking robots. It’s a giant digital Droste effect run wild. Robots checking in on each other and on activity and on who is connecting to what and where. Not a lot of "why" in all this tracking, but I’m sure we’ll get there eventually.
The data has become so voluminous and, in some cases, hard to understand that we’ve thrown up our hands at the idea of organizing or structuring the data. Now we mess around with unstructured data. Free of the prison gates of the spreadsheet all that data can just get lumped and piled together for a new class of superhuman, the data scientist, to pore over — like reading tea leaves — and assemble a future based on predictive analytics.
Oh yes, and there’s this thing about all this tracking and measuring. People don’t like it. People don’t like being tracked and measured. They don’t like "security" cameras in dressing rooms. They don’t like X-ray undressers at airports. And they don’t like people snooping around the websites they look at.
So as this pile of data gets bigger and bigger there’s the issue of Personally Identifying Information, or PII in the lingo of the data-driven. And yes, by the way, dealing with PII just makes the collection of data that much bigger swapping in the identifying bit with the unidentifying bit and the certification bit.
But what’s the force that’s making the collection of all of this data even possible? These robots and so on?
Naturally occurring data.
That’s the phrase to follow and to understand as deeply as possible when examining data issues online. What is naturally occurring online and can be measured makes up the world of naturally occurring data.
The classic example of naturally occurring data has to do with good old-fashioned television measurement. Getting people to fill out those Nielsen surveys was hard. People don’t want to do it. And even when they do it you end up with some wacky screwy sample bias. The old, "According to people who fill out surveys, your favorite television show should be canceled."
But the cable box knows when the TV is on. It also knows what channel is being watched. And configuring that little box to beam back data to the mother ship means that TV watchers are no longer burdened with having to fill out forms to find out whether their favorite TV show should be canceled.
The cable box can simply measure what is naturally occurring: people turning on the TV and doing stuff.
Naturally occurring data is incredibly powerful. It’s powerful because it’s simply a collection of actual behaviors by actual people. It’s powerful because it can help you understand what people are really doing. It’s up to you to figure out why they do it.
It’s also a threat. When you measure what people are actually doing instead of relying on a panel (like the old Nielsen stuff) or a focus group (aka friends of the creative director of your advertising agency) anyone can see what’s going on. It transforms, through data magic, people who were formerly shamans and high priests of knowledge/taste/design/concept into normal people with funkier haircuts and glasses.
Naturally occurring data is also an incredible threat to privacy. Which is unfortunate because its real power isn’t related to the privacy threats it invokes.
Somewhere in your business, there is a mother lode of naturally occurring data. It’s probably encased in government red tape, association policy glue and held tightly by some traditional feudal data lords.
Finding that data, understanding it, learning how to access it/store it/protect it, discovering the alchemy that comes from successfully merging and matching it to other data sets — there’s something in this for everyone.