This article first appeared in DNA India on 13th October, 2016.
A recent Harvard Business Review (HBR) interview on misuse of algorithms got me thinking about whether the algorithms we use can reinforce our biases. However, before I can explain why I feel that way, I would need to explain how machine learning algorithms work to classify datapoints into multiple classes and in order to do that, we need to step back to first understand what classification means.
Classification, as the name implies, is the task of separating out a bunch of objects into separate groups. Let’s say for example, that an organization would like to build a prediction engine that would be able to separate out the excellent performers from the average and the unacceptables. This then, is a classification task since the algorithm is required to go through the list of employees and put them in one of three buckets, viz., excellent, average and unacceptable. Now in order to do this, data scientists would use a machine learning approach. Simply put, machine learning algorithms look for patterns within datasets and learn these patterns corresponding to each of the buckets into which they need to classify items.
Hence, using our example, the algorithm would learn the patterns for excellent, average and unacceptable performers. Now, in order to do this, these algorithms require a lot of historical data. Essentially, we need to “show” these algorithms examples of excellent, average and unacceptable performers. This process is known as “training” the algorithm. This is in effect similar to how we teach the alphabet to little children where we show them the letters again and again and correct them when they get it wrong so that ultimately, they are able to recognize the characters. This is exactly what happens when we “train” our algorithm. This allows the algorithm to then identify “signatures” for each of these buckets. When it is then presented with a different datapoint, it compares the new data against the learnt patterns and if there is a lot of similarity between the two, it assigns the new datapoint to the corresponding bucket.
So far so good. So what can go wrong you ask? Remember I told you that in order to “train” the algorithm, we need to provide it with a lot of examples to learn from? Well, think about where we get the examples from. Let us assume that we are looking at ABC Inc., a traditional, conservative, business based out of Bangalore. Most of the people employed in ABC have come from the southern part of the country. Moreover, given the conservative attitudes of their managers, women are under-represented in this organization in general and at higher levels in particular. ABC now wants to build a model on the best people to hire for their organization and for this, they have made available to us their historical employee database. And this where the problem is. All their biases are part of their historical dataset. Hence, any models that will be trained on this dataset will also inherit these biases. As an extreme case, if we now have a woman applicant from Delhi, she might well get filtered out by the algorithm since she doesn’t fit the typical profile that the model has been trained on. While this is a gross simplification of the issue, it does illustrate the pitfalls associated with building models based on historic data without taking into account biases inherent in the data.
Given that most organizations will have biases of some sort of the other, be they about gender, age or educational institutions, how can they build predictive models if these biases are implicitly included in models built on this dataset? There are no easy answers here. One way is to utilize publicly available datasets or pool with other organizations in the space to obtain datasets that normalize the biases. However, whatever be the approach, any data scientist worth her salt ought to be aware and look for these built-in biases before undertaking any model building.