Category: Statistics

The Misleading Power of Correlation

Have you ever heard a phrase like this? “A new Nicholas Cage movie just came out, so the number of people who drown in swimming pools is about to rise.” Probably not, and if you did hear it from a friend… well, you might have asked yourself a few questions about their mental state. However, looking at the graph below – based on real data – your friend might actually be right.

grafico-correlazione

What does this graph tell us? Let’s start with that number highlighted in the title, known as the correlation index. The linear correlation index is a measure that describes how much one variable changes when another changes. An index of 100% means that as one variable increases, the other increases in exact proportion. The index for the two variables shown in the graph (the number of people who drown in swimming pools and the number of movies featuring Nicholas Cage) is 67%, a rather high value. This means that the two variables have moved almost in sync over time, and we can say that there is a strong correlation between them.

So, where’s the mistake? The mistake in the initial statement lies in assuming a cause-and-effect relationship (causality) between the two variables. The old adage that every statistics student has likely read at least once in their life states: “Correlation does not imply causality.” Or, in other words, correlation is a necessary but not sufficient condition for causality.

It seems like a trivial statement, but in reality, it’s not as obvious as we might think. In the example above, it’s clear that only a madman would imagine a cause-and-effect connection between the two variables, but when applied to more realistic cases, we can say that all correlation measures in any statistical study never assume a causal hypothesis. Every correlation index provides a mere numerical result, and it’s up to us to establish a cause-and-effect relationship based on the logic of the facts or certain assumptions.

As incredible as it may sound, there are many variables that are almost inextricably correlated (correlation coefficients above 90%) but have no logical connection. Here are some real examples from the United States in recent years:

  • Per capita mozzarella consumption and PhDs in civil engineering (correlation 95%)
  • Per capita margarine consumption and divorce rate in Maine (correlation 99%)
  • Barrels of crude oil imported to the USA from Norway and drivers dying in car crashes with trains (correlation 95%)

In all these cases, the two variables are so completely disconnected that the high correlation is undoubtedly due to the irony of chance, and we can confidently rule out any cause-and-effect relationship.

But what about other cases? In other cases, we face situations where seemingly inexplicable correlations are actually phenomena of indirect correlations, often difficult to interpret. Consider the following pairs of variables:

  • Ice cream consumption and shark attacks
  • Air traffic density and spending on cultural activities
  • Number of bars in a city and the number of children enrolled in school

Obviously, these are not direct correlations, since in all these cases, the first variable (A) and the second variable (B) are not directly related to each other. But upon closer inspection, we realize that the two variables are not entirely unrelated, but are both linked to a third, latent or unmeasured variable (variable C), which causes a phenomenon called “spurious correlation.” Any ideas? Think of these variables:

  • Average temperature (both phenomena are more frequent in summer)
  • Per capita income (both phenomena are more likely in cities with a higher average income)
  • Population size (both phenomena are related to the number of people in the city)

With these “hidden” variables, we can solve the mystery of the inexplicable correlations: even though A and B aren’t directly connected, variable A is linked to C (the latent variable), and variable C is linked to B.

Obviously, depending on the complexity of the study, it can be very difficult to understand whether a high correlation index is due to a cause-and-effect phenomenon, a spurious correlation, or neither. The key point is the need to carefully interpret every correlation index to avoid drawing totally wrong conclusions.

So, what’s the takeaway from these examples? I’d say it’s a kind of demystification of “numbercracy,” the idea that numbers have the power to explain reality uncritically. Numbers and statistical indices are useful, indeed incredibly useful, for understanding real-world phenomena, but they always require interpretation and critical judgment before being accepted as dogma and conveying a potentially incorrect meaning.

But just to be safe, the next time a Nicholas Cage movie comes out… stay away from the pools! 🙂

The cake paradox

A few days ago at work, I found myself in one of those situations where numbers behave in a counterintuitive way. A situation where an apparently logical reasoning leads to incorrect conclusions—what I call the cake paradox.

I brought up cakes because I used them to explain (with difficulty) to my colleagues where the trick was.

Let’s go over everything with a similar example (made-up data for illustrative purposes):

  • “Leggo” is an Italian publishing company active in two markets: Italy and China.
  • In Italy, “Leggo” is the market leader, with a 50% market share in 2015 (meaning, for every 100 books sold, 50 are published by “Leggo”).
  • In China, “Leggo” has a 5% market share in 2015.
  • Given the size differences of the markets, “Leggo” has a total market share of 9.1% (aggregating data from Italy and China, with calculations available in the table below).
  • In 2016, “Leggo” increases its sales and manages to steal market share from its competitors in both Italy and China.
    • In Italy, it goes from 50% to 52% market share.
    • In China, it goes from 5% to 6%.

Even though “Leggo” performed better than its competitors in both markets, the overall market share (China + Italy) decreased from 9.1% to 8.8%.
Does something not add up?

How is it possible that even though “Leggo” is doing very well, increasing its sales in both markets, stealing market share both in Italy and China, its total market share is decreasing?

Try it for yourself; all the calculations are in the table:

ItalyChinaTotal
2015 Book market 100 m1,000 m1,100 m
2015 “Leggo” sales50 m50 m100 m 
2015 Market share50%5%9.1%
2016 Book market102 m1,600 m1,703 m
2016 market growth2%60%55%
2016 “Leggo” sales53 m96 m149 m
2016 market share52%6.0%8.8%

The paradox is explained by the two different markets: Italy and China. Italy is much smaller than China, and China is growing at a much higher rate than Italy.

Forget the example of books with realistic numbers; let’s abstract the concept by thinking of the cakes mentioned earlier.

We have two cakes of similar size, one white and one black. Initially, we own 80% of the white cake and only 20% of the black cake.
In total, we own the equivalent of one entire cake (meaning our market share will be 50%, one cake out of two).

Suddenly, the black cake rises and becomes 100 times the size of the white cake.
Our total market share drastically drops from the previous 50% to a number close to 20%, the market share of the huge cake (exactly 20.6%).

Now, it should be easier to understand how our market share, which was previously an average of 20% and 80% (since the cake were the same size), will now be much closer to 20% of the black cake, which is 100 times larger than the white cake.

In practice, our dominant position in the white cake, which used to balance out the black cake (since they were of similar size), now barely matters in a situation where the black cake represents almost the entire market.

The Cake Paradox shows us how it can be detrimental for a company to hold a leadership position in a slow-growth market. Even if it outperforms its competitors in emerging markets, it can still lose market share overall!

The useleness of absolute numbers

A few days ago, I was reading an article about accidents involving cyclists. Being an avid cyclist and dealing with data and numbers of all kinds every day, I immediately noticed this sentence: “The regions most affected by accidents are those where bicycles are a real tradition: Lombardy, Veneto, Emilia Romagna, and Tuscany. Incidents tend to occur on Saturdays and Sundays, between 10 AM and 12 PM, during the months of May to October, with a peak in August.”

What seems odd to you?

After reflecting for a moment, it’s clear that the regions, days, and times when accidents are most frequent are simply those when cyclists are most frequently on the road. This is a type of error or oversight that’s fairly common among journalists, who, being less familiar with numbers, often report data without critically analyzing it.

Saying that accidents happen more frequently on Saturdays and Sundays doesn’t provide any useful information, because those are the two days when cyclists are on the roads the most, and thus are also the days with the highest risk of accidents (the same logic applies to the most popular months and times of day). In this case, the raw data doesn’t make any sense unless it’s somehow “cleaned up.”

What needs to be done in such cases is to have a “benchmark” to compare the results (what’s referred to as a “benchmark” in English). In the case of our article, a simple benchmark could be the ratio between the number of accidents and the number of cyclists on the road that day. This means that, instead of looking at the absolute number of accidents, we look at the relative number. By doing this, we give each day of the week the same “probability” of being the most dangerous day, removing the natural advantage that days like Saturday or Sunday have due to the higher number of cyclists.

Let’s take a look at the table below (note that the numbers are fictional):

DayNumber of accidentsNumber of cyclistsRatio
Mon101.0001,0%
Tue152.0000,8%
Wed101.5000,7%
Thu151.0001,5%
Fri203.0000,7%
Sat408.0000,5%
Sun6010.0000,6%

If we consider the absolute number of accidents, Sunday is the most dangerous day with 60 accidents. However, if we use the correct benchmark, dividing the number of accidents by the number of cyclists on the road, the most dangerous day in relative terms becomes Thursday, with a ratio of 1.5%.

A similar example of this concept is found in marketing: the effectiveness of an advertising campaign targeted at a certain group of people is evaluated not only by looking at the absolute value, but by comparing it to the results of a “control group” – a group of customers who are not exposed to the campaign. Only by “relativizing” the absolute results can we determine whether the campaign was effective or not.

So, when you come across statistics and conclusions like this, pay attention to the data and ask yourself whether they’ve been properly analyzed or not, because otherwise, they might not make any sense.

And anyway, to stay safe and avoid articles like this, when you go out cycling, be careful around cars!

The Sad Story of the Inductivist Turkey

It’s Christmas dinner, an allegory of abundance and a stage for opulence. Your neighbor at the table, probably a fourth cousin whose name you barely remember, is starting to show signs of giving up and is desperately seeking your complicit gaze. But with feigned nonchalance and reckless boldness, you act as if you’re still hungry, even though the amount of food you’ve just consumed could satisfy the caloric needs of the entire province of Isernia. Then, as the third hour of dinner strikes, a new, succulent course is brought out: a stuffed turkey.

At that moment, in a fleeting pang of consciousness – typically left at home during such occasions (otherwise, how else could one explain such an absurd amount of food?) – you wonder about the story behind the turkey in front of you.

This turkey lived on a farm where, from day one, it was fed regularly. The turkey noticed that food was brought every day at the same time, regardless of the season, weather, or other external factors.

Over time, it began to derive a general rule based on repeated observation of reality. It began to embrace an inductivist worldview, collecting so many observations that it eventually made the following assertion:

“Every day at the same time, they will bring me food.”

Satisfied and convinced by its inductivist reasoning, the turkey continued to live this way for several months. Unfortunately for the turkey, its assertion was spectacularly disproven on Christmas Eve when its owner approached as usual, but instead of bringing food, he slaughtered it to serve at the very Christmas dinner you are attending.

The Turkey and Inductivism

This sad story is actually a famous metaphor developed by Welsh philosopher Bertrand Russell in the early 20th century. It clearly and simply refutes the idea that repeated observation of a phenomenon can lead to a general assertion with absolute certainty. The story of the inductivist turkey dates back to a time when Russell opposed the ideas of the Vienna Circle’s neopositivists, who placed unconditional trust in science—particularly inductivism—and regarded it as the only possible means of acquiring knowledge.

The turkey’s example was later adopted by Austrian philosopher Karl Popper, who used it to support his principle of falsifiability. According to this theory—one of the 20th century’s most brilliant—science progresses through deductions that are never definitive and can always be falsified, meaning disproven by reality. There is no science if the truths it produces are immutable and unfalsifiable. Without falsifiability, there can be no progress, stimulation, or debate.

What Does This Mean for the Turkey?

Returning to the turkey’s situation, does this mean it’s impossible to draw conclusions based on experience? Of course not. The study of specific cases helps us understand the general phenomenon we’re investigating and can lay the groundwork for developing general laws. However, the truth of any conclusion we reach is never guaranteed. In simpler terms, if a flock of sheep passes by and we see 100 white sheep in a row, that doesn’t mean the next one will also be white. From an even more pragmatic perspective, no number of observations can guarantee absolute conclusions about the phenomenon in question.

Implications for Statistics and Inference

Statistics, and particularly inference—a core component of statistics—derive their philosophical foundations from this concept. The purpose of inference is to draw general conclusions based on partial observations of reality, or a sample.

For example, let’s say we want to estimate the average number of guests at a Christmas dinner. How would we do that? Let’s set aside the turkey for a moment, put down our forks and knives, and imagine we have a sample of 100 Christmas dinners where we count the number of guests. Based on a fundamental theorem of statistics known as the Central Limit Theorem, we can assert that the average number of guests observed in our sample is a correct estimate of the true population mean (provided the sample is representative and unbiased, but that’s a topic for another day). Moreover, the error in this estimate decreases as the sample size increases. In other words, the more dinners we include in our sample, the more robust and accurate the estimate becomes. Logical, right?

But how certain are we that our estimate is correct? Suppose we’ve determined that the average number of guests across 100 dinners is 10. From this observation, we can also calculate an interval within which the true average is likely to fall. With a sample of 100 units, we can assert with a certain level of confidence (typically 95%) that the true average number of guests is between 7 and 13. With a sample of 200 units, our estimate becomes more precise, narrowing the interval to 8 and 12. The larger the sample, the more accurate the estimate.

Absolute Confidence? The Turkey’s Warning

These estimates are valid with a 95% confidence level. But what if we wanted 100% confidence? Would it be possible? Here’s where our inductivist turkey makes its comeback. If we wanted 100% confidence, we would fall into the same trap as the turkey—attempting to draw conclusions with absolute certainty from a series of observations. As we’ve seen, at the turkey’s expense, this is impossible. The explanation is simple: even with a large and representative sample, it’s never possible to completely eliminate the influence of chance. There’s always a small probability that we’ll encounter an observation—like a Christmas dinner with more or fewer guests than our confidence interval predicts—that contradicts our estimates.

Thus, what statistics can offer in such cases is a very robust estimate of the parameter we’re studying (instead of the number of dinner guests, think about something more critical, like the average income of a population, the efficacy of a drug, or election polls). However, it can never provide absolute certainty about a phenomenon. This is because the world we live in is not deterministic but is partly governed by chance. In this sense, statistics is a science that demonstrates the “non-absoluteness” of other sciences, which is perhaps why it is often feared or disliked.

After all, statistics reached its peak development in the 20th century, the century of relativism—think of Einstein’s theory of relativity, Heisenberg’s uncertainty principle, or Popper’s criterion of falsifiability.

Now, it’s time to eat the turkey before it gets cold!

© 2025 Datastory.it

Theme by Anders NorenUp ↑