This article first appeared in DNA India on 3rd September, 2016.

One of the classic mistakes that people make when dealing with data is in not asking the right questions. It doesn’t matter how much data one has if the right questions aren’t asked. Incorrect inferences drawn from data have sunk more companies and ships than perhaps any other cause. Among the many ways people say (post facto of course, hindsight being 20-20 and all) the Titanic could have averted disaster given multiple messages from other ships in the vicinity of floating icebergs, is for the Captain to have asked whether their risks would go down considerably were they to simply stop in the water and proceed come daylight.

Why am I raking this up now? During one of the conferences that I attended, I met a senior HR leader of a large firm triumphantly announce that they were taking data-based decisions in his company. Intrigued, I asked him to explain how they were doing that. He mentioned that a recent analysis of the data had showed that over 60% of the poor performers in his organization came from college X and hence they had decided that they were not going to hire from that particular college. It seems fairly logical at first glance doesn’t it? An organization using data to make decisions – the kind of decisions that makes people feel and look good. However, there was one issue with this particular decision. The fact that even with the best of intentions, the organization had not asked the right question given the data that they did have.

The question that the organization had asked was: **What is the probability that the employees were from college X if their performances were poor?** However, the actual question they should have asked in order to make the decision that they did in refusing to recruit from the given college was: **What is the probability that employees would perform poorly if they were from college X?** While it could sound slightly confusing to start with, a little bit of thought shows that there these two are linked but different questions. These probabilities are known as conditional probabilities because, as the name suggests, they are dependent on another event happening. However, the probability of an event A occuring given another event B has occured is not necessarily the same as the probability of an event B occuring given that event A has occured. There as a whole branch of statistics known as Bayesian statistics, named after the man who founded it – Reverend Bayes. The beauty of Bayesian statistics is that in the absence of information, we can start off with a prior estimate of the probabilities and then based on information that becomes available, adjust our initial probabilities to reflect the true probability.

Assuming that 5% of an organization are rated as poor performers at any given point in time, the answer to the question originally posed by the manager on the probability that a person was from a given college provided they were poor performers, works out to around 60%. That seems incredibly high and hence would necessitate dropping that college from any future recruitment plans. However, without going into the gory statistical details, the probability that a person from that same college becoming a poor performer is roughly 7.3%. While this is still more than double the percentage of people from other colleges becoming poor performers (at 3.4%), it is nowhere near as alarming as the original scenario made it out to be.

Hence, it is imperative that organizations take cognizance of the fact that while data-driven decision making is good, correct analysis of the given data ought to be a precursor to that. And correct data analysis can only be done if we ask the right questions of the data. To quote Ronald Coase, “If you torture that data long enough, it will confess”.