Data Scientists are primarily Data Janitors

This article first appeared in DNA India on 1st October, 2016.

You know how it is. Data science is considered to be one of the sexiest occupations in the world right now. Nary a week goes by when some report or the other talks up data scientists as some sort of a super breed. There is much talk about how data scientists command inordinately large sums of money. Institutes queue up to offer analytics courses. Participants queue up to get certificates, any certificates that have the word “analytics” attached to it. I am sure all of this must be stoking the egos of all the data scientists out there. But let me in on a little secret. A data scientist’s job is not all that glamorous. A data scientists job, consists for the most part of what, for lack of a better description, I would call as garbage collection.

“Eh what?”, you ask. “But we thought you just finished telling us that world is the data scientists oyester!”. True, that’s what it looks like from the outside, but as I mentioned, the inside story is radically different. What do you think a data scientist does much of her time? If you thought that this involved running cool algorithms, generating awesome visualizations and running through Matrix-like screens of scrolling numbers and identifying patterns, that really isn’t the case. Over 80% (and sometimes it could even go as high as 90%) of a data scientist’s time is spent on the boring but important job of cleaning up the data. Once we have clean data, the job of running the appropriate algorithms and in coming up with a meaningful insight appears almost trivial in contrast.

Still think I am kidding? Look at it this way. In any assignment, the organization furnishes the data pulled together from multiple data sources. This is even more so for employee related information where data such as leave, payroll, benefits, sign-in/sign-out times, rewards and recognition, personal information, performance data are all usually kept in separate databases. Some (most?) of these could even exist as Excel sheets on some HR manager’s laptop. This implies that all this data has to be pulled together in some meaningful way. This is however, beyond the remit of the database person or the HR manager pulling this information out from the various data sources. So who get’s to deal with this jumbled mass? The poor data scientist.

So what kinds of cleanup does this data require? The most common example is to clean up any and all missing information; and believe you me, there is a bunch of information that is missing. This however is the easy part. The most challenging janitorial task is to clean up columns containing categorical variables. What are they? Let’s assume there was a column like Rewards and Recognitions won. In an ideal world, there would be a controlled vocabulary such that every entry in that column would have one of a set of values. However, this is rarely the case. There are spelling mistakes galore. Moreover, different people type the same information in different ways. A “Walk the Extra Mile or WTEM” award could be spelt “WTEM”, “Walk the Extra Mile”, “Walk the Xtra Mile” and so on. You get the drift. The data scientist needs to go through every such categorical column and impose a set of standardized vocabularies.

Once done with this herculean task, the poor sap has to then look for columns that give identical information. The reason for this is that having variables that provide the same information tends to confuse issues when identify patterns. On the simpler side, this involves looking for correlations between variables and removing those with high correlations to primary variables. At the other end, it could involve using techniques like Principal Components Analysis to identify combinations of variables that could give us higher information than the variables by themselves or feature selection algorithms that identify the set of features with the greatest information content.

Another necessary step is in identifying outliers. This is not as simple a task as one might think. For example, supposing the goal is to come up with a model to predict employee performances. Let us assume that there are two individuals who have consistently exceeded expectations over the last five years and stand apart from the rest of the crowd. Should we include them in the dataset? If the goal is to predict how most of the crowd would perform, then these two are outliers and ought not to be taken into account. Similarly if there were two others who were consistent poor performers, they might be outliers on the other side of the bell curve. This is often a judgment call that the data scientist has to make and requires domain knowledge in addition to data mining skills.Once all this is done, the dataset can then be said to be in a manageable state for further computation.

So dear reader, now you know what a data scientist does with her time, do you still look with envy on the data science wonderkid? Isn’t she more to be pitied than censured?