Five Types of Biased Data
Machine learning feels like black magic to most outsiders (myself included). Because the inner workings are difficult to understand, it is difficult to critique and assess. That’s problematic when so much of our world is governed by complex models based on opaque data sets.
Thanks to [this tweet], I came across a [great paper] that presents a framework for identifying different types of data biases present in machine learning. While the lay person won’t be able to grasp the complexities of a given model, they can understand the weaknesses of the data used to build and measure the model. And they can learn enough to ask intelligent questions of model builders.
The framework presents 5 different types of bias: Historical, Representation, Measurement, Aggregation, and Evaluation.
Historical bias arises even if the data is perfectly measured and sampled, if the world as it is leads a model to produce outcomes that are not wanted.
Historical bias comes from the state of world, and can not be addressed through better measurement or sampling. For example, an image search for the term “CEO”, could return primarily male faces, since most Fortune 500 CEO’s are men. Without intervention, stereotypes can be perpetuated and reinforced. Fortunately, in this example Google has modified their search algorithm to show more female faces in the results.
Representation bias occurs when a subgroup of the population being examined is underrepresented. This is sometimes called selection bias. There are several causes of this type of bias:
- Sampling methods are insufficient to reach the whole population (e.g. data from smartphones might exclude certain age groups).
- The population being sampled, is different from the population being examined (e.g. extrapolating data from one city to a completely different city)
- The population has changed over time (i.e. data may no longer representative of a population if enough time has passed).
This type of bias can cause problems for image classification models. If there is a bias towards a certain geographic area, the model may perform poorly when examining images sourced from underrepresented regions.
Certain types of data are difficult to gather, which leads to the use of proxies for ideal measures of a population (e.g. arrest rates for crime rates). Measurement Bias occurs when these proxies are created differently for different groups. This bias can arise when the granularity or quality of data varies across groups. It can also be introduced when the proxy is an oversimplification of an outcome. For example, GPA is a poor approximation of student success.
Aggregation bias is caused by using the same model across a variety of different subgroups, specifically ones that behave differently with regards to what is being measured. A drug with side-effects that predominantly impact a certain ethnic group is a hypothetical example. Poor performance in a single subgroup may be masked by the performance in aggregate.
_Evaluation Bias _ occurs when the data used to evaluate the model is misrepresentative of the population. Compounding this is the desire for aggregate measures of model performance, which may hide underperformance of a model for a particular subgroup. According to the paper, Facial recognition algorithms have historically performed poorly when identifying dark-skinned females. In aggregate these algorithms may show acceptable performance, but for a specific subgroup the underperformance is unacceptably low.
Use this simple framework, the next time you encounter a complex model, and want to probe for where it might break down.