Data Science vs. Big Data
Two of the hottest topics at this year’s X Change Conference were, unsurprisingly, “Data Science” and “Big Data”. If you walk through airports, read Time Magazine or even watch Fox News (check out the old Semphonic World Headquarters) you’ll get plenty of hype around both. But the folks at X Change are on the front lines of this stuff – and that sort of hype doesn’t cut much ice.
I’ve argued that although there is plenty of misleading hype around the term big data, it nevertheless captures something different, real and important. And I’ve even worked to explain exactly what that is (I’ll be presenting a newer version of that argument at IBM’s IOD Conference in early November).
But what about data science? Just how real is data science and what exactly does it mean?
I didn’t know this, but one of the things I learned in the discussion at X Change is that the term was originally coined by a statistician who argued that statisticians were (and should be re-named) data scientists since they spent most of their time manipulating and experimenting with data. Given the relative demand (and pay scales) for those selling themselves as “data scientists” vs. “statisticians” that’s fairly amusing. It turns out that statisticians just needed a good marketing campaign to double or triple their salaries!
Origins aside, it’s not that easy to unpack what people mean when they talk about data science. But the emergent (and best) definition I heard at X Change is that a data scientist is someone who can work at every stage of an analysis and tackles problems that involve data manipulation, advanced statistical analysis (particularly those that require custom computational or algorithmic techniques), and interpretive and expository skills. In the Huddles I was in, we ended up calling this a “Full Stack” analyst.
On this definition, I probably come reasonably close to being a data scientist. As someone with a software development background and real chops in C++ and C# (not to mention toy stuff like SQL), there’s pretty much no data manipulation I can’t do. I’m not the world’s deepest statistician though, and this would probably be my downfall in the ranks of pure data scientists. Still, I have a pretty strong history in computational and algorithmic analytics. I was coding and using Self-Organizing Maps (SOMs) back in the ‘90s, I’ve created my share of true algorithmic analytic methods, I’ve done my time in SAS, and I’ve written software (and used it to) that incorporated a wide variety of advanced statistical techniques and visualization tools (in my days doing real-time technical trading analytics I programmed and used everything from Black-Scholes models to Simulated Annealing). I think I can handle the interpretive and expository stuff pretty well too.
Big whoopee, right? Being full stack is great, but how important is it really?
Back when I was programming Black-Sholes models, I had some pretty smart folks explaining the models (and corrections to those models) that they wanted me to program. They didn’t need to know C++. I didn’t need to be a stats genius. It still worked pretty well. If you’re doing data science via a team, you’re still doing data science.
I’ve no doubt that having the full stack package in a single person reduces the cycle time on projects that involve computational analytics and data manipulation. But resourcing to the full stack in a single person can dramatically extend the time it takes to actually fill a position and can have a similar impact on cost. I’ve read plenty of data science job postings that could have changed the job title to “Superman” without materially impacting the odds of finding a plausible candidate.
What’s more, it isn’t clear to me that the value of a data scientist is equally distributed along all these dimensions. Frankly, I don’t think people pay my considerable rack rate because I’m full stack. Few of my current clients benefit from my skill as a C++ programmer (though I’m not saying that knowledge doesn’t sometimes come in handy on higher-level tasks).
This also makes me wonder about the true value of most people who can plausibly claim to be full stack. Going back to that original definition, who’s the group most likely to be full stack? Statisticians. Most professional statisticians may not have my programming chops, but there are many who are quite skilled in data manipulation and algorithmic analysis. Being the real McCoy, they are going to cream me in depth of statistical knowledge. But how useful are most of these people when it comes to actually performing interesting and useful analytics?
I’ll let your enterprise’s experience with statisticians answer that question.
I can’t resist adding that the one type of academic background I would never hire in our practice was…statistician. Programmers, Economists, Mathematics, Psychologists, Bio-Med – I’ve found folks in all these disciplines who combined an ability to do analytics with a penchant for solving real-world business problems.
Why no statisticians? I think it’s because a good data scientist will think of statistics not as their discipline, but as a tool for their discipline.
So I’m deeply suspicious of data science. In a “hype-off” between data science and big data, I think data science wins by a lot. There’s a lot less there, there.
Having gone that far, I feel compelled to add a little nuance.
In my particular field – digital analytics – analysts have traditionally been far, far short of full stack. Because of the SaaS model and the lack of sophistication in digital analytics tools, it’s fair to say that more digital analysts have neither data manipulation skills, statistical analysis skills, nor computational analytics skills. The stack, far from being full, can look a bit threadbare.
That’s a legitimate problem in a world where digital analytics data is now widely available outside Web analytics tools. I don’t think it’s necessary for a great analyst to be full stack. I do think a great analyst ought to have to have at least one of those additional skills.
What’s more, I think that in digital analytics (and big data in general), computational analysis will be somewhat more important than it is in many other disciplines. My reasons for that are tightly bound to my arguments for why big data is different than traditional data and why statistical analysis methods often fail when it comes to digital analytic problems.
Plus, when it comes to computational analytic methods, it can be hard to build a team that works. It’s much harder for a programmer to build complex models in code than to do ETL for a statistician. You need the right combination of communication skills in both directions, and that might prove to be nearly as elusive as getting the skills in one person.
Back when I was programming trading systems or doing credit card analytics, if I wanted to use Neural Nets or SOMs, I had to program them. And to program them, I had to understand at some level how they worked. These days, those tools are available out of the box. But for much of what I think is going to work in digital analytics big data, there won’t be out-of-the-box tools. Even something as simple as the Topographic analysis I’ve written about requires custom coding.
So it’s possible that the whole data scientist thing might really do some good. With the economic rewards being heaped on those who can do computational analytics, there’s bound to be significant growth in people who are skilled at it.
I just hope they aren’t all statisticians.