Calling bullshit in the name of big data!

This article first appeared in DNA India on July 27th, 2017.

We are inundated with information and data. Everyday. Wherever we go. Whichever source of news we tap into. This incessant stream of data hits us with the result that we are often left bewildered as to whether what we are reading is true or not. Imagine a typical click-bait headline: “Studies show that 63.5% of people who eat 2 carrots before dinner lost on average 1.5 Kgs more than those who don’t”. I don’t know about you but I look at something like this and my “spider-senses” start tingling all over the place. What does this even mean? Which study? The studies are rarely ever mentioned, and even if they are, they are probably published in “The Statistical Journal of Voodoo” published in Timbucktoo or some such place. Even if these are published in reputed journals, once you read the actual article, what you realise is that the headline has nothing at all to do with the actual results from the paper.

This then is going to be the bane of our current century. How does one make sense of all this data? There is a wonderful course that two professors are coming up with in the University of Washington called “Calling bullshit in the age of big data”, which is meant to arm individuals with the ways and means by which numerical skullduggery can be identified quickly.

But, it is not skullduggery alone that is the danger when dealing with data analysis. We can come to erroneous conclusions even from purely legitimate motives when faced with massive amounts of data. My doctoral professor once told me that given enough data and variables, she could model her own mother-in-law. While I sincerely doubt that anyone would want to do that, it did drive home the point to a somewhat green student trying to come to grips with mathematical modelling.

One of my favourite case studies when I teach deals with just this issue. The case revolves around a bank that commissions a consulting firm to conduct a survey. The idea is to figure out whether employee perceptions (of the organisations’ adherence to quality standards and customer orientation) truly have an impact on how their customers perceive their organisation. And whether this, in turn, has an impact on the productivity metrics of the organisation.

The consultants carry out two different surveys for the employees and the customers, respectively, and then collate all the data. Students are then asked to go through the case and work on the data provided with the case; data on nearly 120 odd branches and the aggregated responses to the various questions.

As I discuss the case in class, I find that the students have all carried out similar exercises of looking for correlations between the different variables and even identified specific sets of employee responses that are correlated to the customer responses. The trick here is that none of the correlations values is above 0.3 at the most.

What the students should have concluded at the end of the exercise is that there is not enough signal in the data to say definitively that employee perceptions correlate to customer perceptions. Since they have been given an abundance of data, they feel compelled to come up with some sort of an answer demonstrating relationships which are tenuous at best.

So what do we do when we have masses of data? Stop looking too hard for patterns to exist. Sometimes they exist only in our imagination.

Image and Title Credit:

Leave a Reply

Your email address will not be published. Required fields are marked *

two + = 8

+ 22 = 26