How Data Scientists Turned Against Statistics
5 min read
One of the most remarkable stories of the rise of “big data” is the way in which it has coincided with the decline of the denominator and our shift towards using algorithms and workflows into which we have no visibility. Our great leap into the world of data has come with a giant leap of faith that the core tenets of statistics no longer apply when one works with sufficiently large datasets. As Twitter demonstrates, this assumption could not be further from the truth.
In the era before “big data” became a household name, the small sizes of the datasets most researchers worked with necessitated great care in their analysis and made it possible to manually verify the results received. As datasets became ever larger and the underlying algorithms and workflows vastly more complex, data scientists became more and more reliant on the automated nature of their tools. In much the same way that a car driver today knows nothing about how their vehicle actually works under the hood, data scientists have become similarly detached from the tools and data that underlie their work.
More and more of the world of data analysis is based on proprietary commercial algorithms and toolkits into which analysts have no visibility. From sentiment mining and network construction to demographic estimation and geographic imputation, many fields of “big data” like social media analysis are almost entirely based on collections of opaque black boxes. Even the sampling algorithms that underlie these algorithms are increasingly opaque.
A media analyst a decade ago would likely have used research-grade information platforms that returned precise results with guaranteed correctness. Today that same analyst will likely turn to a web search engine or social media reporting tool that returns results as coarse estimations. Some tools even report different results each time a query is submitted due to their distributed indexes and how many index servers returned within the allotted time. Others incorporate random seeds into their estimations.
None of this is visible to the analysts using these platforms.
There is no “methodology appendix” attached to a keyword search in most commercial platforms that specifies precisely how much data was searched, whether and what kind of sampling was used or how much missing data there is in its index. Sentiment analyses don’t provide the code and models used to generate each score and only a handful of tools provide histograms showing which words and constructs had the most influence on their scores. Enrichments like demographic and geographic estimates often cite the enrichment provider but provide no other insight into how those estimates were computed.
How is it that data science as a field has become OK with the idea of suspending its disbelief and just trusting the results of the myriad algorithms, toolkits and workflows that modern large analysis entails?
How did we lose the “trust but verify” mentality of past decades in which an analyst would rigorously test, perform bakeoffs and even reverse engineer algorithms before ever even considering using them for production analyses?
Partially this reflects the influx of non-traditional disciplines into the data sciences.
Those without programming backgrounds aren’t as familiar with how much influence implementation details can have on the results of an algorithm. Even those with programming backgrounds rarely have the kind of extensive training in numerical methods and algorithmic implementation required to fully assess a particular toolkit’s implementation of a given algorithm. Indeed, more and more “big data” toolkits suffer from a failure to understand the most rudimentary issues like floating point resolution and the impact of multiplying large numbers of very small numbers together. Even those with deep programming experience often lack the statistics background to fully comprehend that common intuition does not always equate to mathematical correctness.
As data analytics is increasingly accessed through turnkey workflows that require neither programming nor statistical understanding to use, a growing wave of data scientists hail from disciplinary fields in which they understand the questions they wish to ask of data but lack the skillsets to understand when the answers they receive are misleading.
In short, as “big data analysis” becomes a point and click affair, all of the complexity and nuance underlying its findings disappears in the simplicity and beauty of the resulting visualizations.
This reflects that as data science is becoming increasingly commercial, it is simultaneously becoming increasingly streamlined and turnkey.
Analytic pipelines that once connected open source implementations of published algorithms are increasingly turning to closed proprietary instantiations of unknown algorithms that lack even the most basic of performance and reliability statistics. Eager to project a proprietary edge, companies wrap known algorithms in unknown preprocessing steps to obfuscate their use but in doing so introduce unknown accuracy implications.
With a shift from open source to commercial software, we are losing our visibility into how our analysis works.
Rather than refuse to report the results of black box algorithms, data scientists have leap onboard, oblivious to or uncaring of the myriad methodological concerns such opaque analytic processes pose.
Coinciding with this shift is the loss of the denominator and the trend away from normalization in data analysis.
The size of today’s datasets means that data scientists increasingly work with only small slices of very large datasets without ever having any insights into what the parent dataset actually looks like.
Social media analytics offers a particularly egregious example of this trend.
Nearly the entire global output of social media analysis over the past decade and a half has involved reporting raw counts, rather than normalizing those results by the total output of the social platform being analyzed.
The result is that even statistically sound methodologies are led astray by their inability to separate meaningful trends in a sample from the background trends of the larger dataset from which that sample came.
For a field populated by statisticians, it is extraordinary that somehow we have accepted the idea of analyzing data we have no understanding of. It is dumbfounding that at some point we normalized the idea of reporting raw trends, like an increasing volume of retweets for a given keyword search, without being able to ask whether that finding was something distinct to our search or merely an overall trend of Twitter itself.
The datasets underlying the “big data” revolution are changing existentially in realtime, yet the workflows and methodologies we use to analyze them proceed as if they are static.
Even when confronted with the degree to which their datasets are changing and the impact of those changes on the findings they publish, many data scientists surprisingly push back on the need for normalization or an increased understanding of the denominator of their data. The lack of a solid statistical foundation means many data scientists don’t understand why reporting raw counts from a rapidly changing dataset can lead to incorrect findings.
Putting this all together, how is it that in a field that is supposedly built upon statistics and has so many members who hail from statistical backgrounds, we have reached a point where we have seemingly thrown away the most basic tenets of statistics like understanding the algorithms we use and the denominators of the data we work with? How is it that we’ve reached a point where we no longer seem to even care about the most fundamental basics of the data we’re analyzing?
Most tellingly, many of the responses I received to my 2015 Twitter analysis were not researchers commenting on how they would be adjusting their analytic workflows to accommodate Twitter’s massive changes. Instead, they were data scientists working at prominent companies and government agencies and even leading academics arguing that social media platforms were so influential that it no longer mattered whether our results were actually “right,” what mattered was merely that an analysis had the word “Twitter” or “Facebook” somewhere in the title.
The response thus far to this week’s study suggests little has changed.
In the end, it seems we no longer actually care what our data says or whether our results are actually right.