As regular readers of this column know, I love data. I love it when there’s lots of it. And I love it when there are several sources of it. I find it very useful.
But this week I had the opportunity to listen to a master of understanding data, Edward Tufte. He had some criticisms of big data that I think are well worth considering — especially as the buzz around big data reaches fever pitch.
I’ll do what I can to present some of his thoughts, though almost certainly with less elegance.
Big data, so what?
First up, the bigness of data isn’t that impressive really. According to Tufte, what he calls "real scientists" have been using large amounts of data for many years now. They need to in order to do things like map DNA or count the number of stars out there or work with advanced particle physics.
Business leaders and policymakers are only just now getting into the big data game by way of marketing and hardware advancements (and the people who need to sell hardware advancements).
Much of so-called "big data" is redundant
Many of the individual data points are redundant. It’s easy to see how this is so. If you dig through any relatively well-connected person’s digital address book you’ll likely find several entries that correspond to a single individual. With even larger data sets this sort of thing multiplies.
Once you go through the effort to clean up the data, maybe it isn’t so big after all. The cleanliness of data sources is a constant bugbear for a database of any size.
Much of so-called "big data" is irrelevant
Tufte used an example of a security camera. Imagine a security camera pointed at a corner. In the course of weeks or months it is gathering up image data. Almost all of that data is, in the first place, redundant. It’s simply the same picture over and over and over again.
In the second place, most of that data is irrelevant. Most of the time a crime is not being committed. Maybe if something happens there once a year, then 10 minutes of footage is relevant. The other 525,938 minutes are irrelevant.
In this instance you certainly have a lot of data, big data even. But the part that’s useful isn’t big at all.
"Real science," according to Tufte, handles this by first coming up with a question and then examining the data source. Much of business data mining has this approach backwards, and as a result is able to draw questionable causal relationships between data in any given set.
Incentive hazards
The redundancy and irrelevancy of big data would be relatively harmless on their own. But problems arise when there are compensation-related issues. And with big data, there most certainly are.
There are incentive hazards at two points in the technology discussion. One is in the sale of hardware. Maintaining and processing large amounts of data requires large amounts of hardware and often expensive software.
Making "big data" a required bullet point on any project creates a de facto capitalization requirement. This is one of the points that I think is especially important in the real estate industry.
Running the big iron that can handle big data isn’t cheap. Many of the non-venture-capital-backed operations in real estate would have trouble competing in a world where big data was a requirement. Certainly the "Balkanized" nature of the industry wouldn’t help.
This is a strategic situation. And it’s one that is actively in play.
The other layer of incentive hazard is that it almost requires that anyone involved in examining or "mining" big data come up with some vast conclusion or insight. This is dangerous.
This is especially dangerous when those doing the examining and mining are the same people who will benefit from the findings. It is the classic "pay the auto repair shop to tell you what’s wrong with your car" problem.
In what Tufte calls "real science" there is no incentive to find something when nothing is there. In fact, with the system of peer review in journals there’s a pretty heavy disincentive to make stuff up.
Business culture is not like this. If a business pays a lot of money to a consultant to examine the gigantic pile of data (which may be irrelevantly and redundantly giant), then they better come up with something or else the money will be considered "wasted" and the person who commissioned the work will have failed. The temptation for a CYA (cover your ass) approach scales in relation to the size of the data and the size of the consultant fee.
Some thoughts
Tufte didn’t dissuade me from continuing to examine and use big data. But he certainly nailed some critical issues in the field. These issues are ones that are worth examining with any vendor or consultant who brings up the "need" for big data.