HR, Big Data and the trap of simplification

January 14, 2019

How disregarding uncertainty in your data can lead to costly mistakes

These days, we can observe a general trend towards more data-driven decision-making in firms. This trend also applies to areas and functions that have traditionally been less “quantitative” such as human resources.

Picture of a laptop that shows different statistics
More and more decisions in companies are made on the basis of data. A trend that continues. Photo © CC0 Licence

However, the HR function is catching up quickly. There are a number of big data and analytics applications out there and firms also experiment with recruiting robots based on Facebook Messenger or automated screenings of candidates for trainings. Yet, I regularly observe some widespread knowledge gaps among HR professionals with regards to some basic statistic principles. From my perspective, one of the most important ones is how to deal with uncertainty in (HR) data. Particularly, I observe that many managers, who often only have data from a sample and not from the entire population, make the mistake of treating their sample as if it was the population. Overlooking this point can lead to wrong decisions and potential costly mistakes.

The “uncertainty problem” of samples

Usually, HR managers are interested in data on their whole organization, for example, data on all project teams of the firm, or all employees, or all current expats etc. However, while we might want to get data on all employees and analyze all those people, teams, or expats, data collection might be costly and time consuming. Therefore, HR managers are often required to collect data only from a subset of these groups, i.e. a “sample”. Putting it simply, sampling means that you do not use all available information.

Picutre of a single token standing alone next to a group of tokens
In a sample analysis, not all available information is used. Photo © CC0 Licence

The common approach is to run some statistics on the available data and to use the results as good “estimates” for the whole population. For example, let’s assume a firm has 300 expats and it has collected data on a sample of 100 of those. This year the average satisfaction with the firm’s expat preparation courses prior to their appointment rated 3.8 out of a 5 points (5 being very satisfied, and 1 representing “very unsatisfied”). During the last two years, when the firm collected data on all expats it was 4.2 in both years. The HR manager might conclude that she has to change the prep courses because “it’s getting worse”!  This is how it shows up in the chart:

filler-news-vereinfachungsfalle3.jpg

Yet is this the right interpretation?

The problem with this conclusion is that it is based on the assumption that 3.8 in the sample represents 3.8 in the total population. This is not correct! If a different sample had been taken, the average satisfaction might have been 3.5 or 4.0 or even 4.5 simply because the HR manager got, entirely by chance, more or less disgruntled expats into the sample that influenced the average course rating. Thus, we just have an indication of how all the 300 expats evaluate the course which is based on the sample (and this tells us, it’s “3.8” this year), but there is some uncertainty with regard to this 3.8… it could very well be different.

If managers do not take this uncertainty into account, they might over-react or under-react. For example, let’s assume, that in reality all 300 managers have a satisfaction level of 4.2 while our sample, entirely by chance, tells us it’s 3.8 because of some more unhappy expats in the sample data. What would the conclusion then be? Well, here, we might conclude that the course is still fine – actually, the same satisfaction level as the entire group of expats in the last two years. However, if all managers have an average satisfaction level of 3.6 (and our sample still shows 3.8), then we might be really worried that there is something wrong with the courses.

4 people hold a sign with a green symbol in front of their faces
If the satisfaction in the sample is high, this does not mean that it also applies to the entire population. Photo © CC0 Licence

So if managers just simply assume that a statistic such as an average from a sample is the same in the total population, he or she might make mistakes. These mistakes can be under-reactions (the manager should change the prep course but does not) or over-reactions (the manager should not change the prep course as there is no clear evidence of a trend towards more dissatisfaction, but did change a well-functioning course) – both decisions are thus potentially costly and time-consuming mistakes.

The solution

In statistics we would call results from a sample a “point estimate”. A point estimate by itself might be a good start to think about the total population (it is a first good guess), but a point estimate does not provide any information how “good” this estimate is – it does not take uncretainty into account.

Good news is: if your sample was taken randomly, statistics can help us get an understanding of the error caused by using a sample and not the full population. We will never know for sure what the true population value (the average satisfaction of the entire group) is until we actually collect data from the entire population. Yet, we can still deal with this issue using confidence intervals.

Confidence intervals can also be called “range estimates”. Contrary to point estimates, a range estimate provides a whole range of potential population estimates that are likely to be true. For our example above, instead of assuming that the 3.8 average of the sample is also 3.8 in the total population, we would compute the confidence interval. Then we would base our decision-making on a statement that says that we can be 95% confident that the true population average lies somewhere in the range between 3.6 and 4.0.

A man is showing thumbs up
The confidence intervals can be used to calculate the bandwidth in which a value is most likely to be located. Photo © CC0 Licence

The point about the confidence interval is that our conclusions from the data become very different: we moved from a simple point estimate (the satisfaction of all expats is 3.8) to a range estimate (it is quite likely that expat satisfaction lies somewhere in the range between 3.6 and 4.0) and, therefore, we might make a different decision. In this case, we could conclude that the difference between 4.2 and the quite likely 4.0 of this year is not big enough to engage into redesigning the course.

To sum it up, by taking random samples and computing range estimates instead of point estimates, we acknowledge that our estimate of the population is to some degree uncertain and we are better equipped to avoid costly under- or overreactions.

Prof. Phillip Nell also wrote an article about how politics can be a risk factor for businesses. Read it here.

Share this