We’re back!

After 8 long years, Datastory.it begins its second life. Then again, if David Lynch could wait over 25 years between seasons of Twin Peaks, our 8 years don’t seem so long.

Many things have changed during this time, but not our passion for crafting stories through data. We hope you’re just as eager to read them as we are to share them!

Our manifesto remains the same, driven by the same enthusiasm as in the beginning, but now enriched with even more stories, anecdotes, and experiences to tell. Plus, starting today, the site features an English section with all articles translated.

Sit back, relax and enjoy the read!

The Misleading Power of Correlation

Have you ever heard a phrase like this? “A new Nicholas Cage movie just came out, so the number of people who drown in swimming pools is about to rise.” Probably not, and if you did hear it from a friend… well, you might have asked yourself a few questions about their mental state. However, looking at the graph below – based on real data – your friend might actually be right.

grafico-correlazione

What does this graph tell us? Let’s start with that number highlighted in the title, known as the correlation index. The linear correlation index is a measure that describes how much one variable changes when another changes. An index of 100% means that as one variable increases, the other increases in exact proportion. The index for the two variables shown in the graph (the number of people who drown in swimming pools and the number of movies featuring Nicholas Cage) is 67%, a rather high value. This means that the two variables have moved almost in sync over time, and we can say that there is a strong correlation between them.

So, where’s the mistake? The mistake in the initial statement lies in assuming a cause-and-effect relationship (causality) between the two variables. The old adage that every statistics student has likely read at least once in their life states: “Correlation does not imply causality.” Or, in other words, correlation is a necessary but not sufficient condition for causality.

It seems like a trivial statement, but in reality, it’s not as obvious as we might think. In the example above, it’s clear that only a madman would imagine a cause-and-effect connection between the two variables, but when applied to more realistic cases, we can say that all correlation measures in any statistical study never assume a causal hypothesis. Every correlation index provides a mere numerical result, and it’s up to us to establish a cause-and-effect relationship based on the logic of the facts or certain assumptions.

As incredible as it may sound, there are many variables that are almost inextricably correlated (correlation coefficients above 90%) but have no logical connection. Here are some real examples from the United States in recent years:

  • Per capita mozzarella consumption and PhDs in civil engineering (correlation 95%)
  • Per capita margarine consumption and divorce rate in Maine (correlation 99%)
  • Barrels of crude oil imported to the USA from Norway and drivers dying in car crashes with trains (correlation 95%)

In all these cases, the two variables are so completely disconnected that the high correlation is undoubtedly due to the irony of chance, and we can confidently rule out any cause-and-effect relationship.

But what about other cases? In other cases, we face situations where seemingly inexplicable correlations are actually phenomena of indirect correlations, often difficult to interpret. Consider the following pairs of variables:

  • Ice cream consumption and shark attacks
  • Air traffic density and spending on cultural activities
  • Number of bars in a city and the number of children enrolled in school

Obviously, these are not direct correlations, since in all these cases, the first variable (A) and the second variable (B) are not directly related to each other. But upon closer inspection, we realize that the two variables are not entirely unrelated, but are both linked to a third, latent or unmeasured variable (variable C), which causes a phenomenon called “spurious correlation.” Any ideas? Think of these variables:

  • Average temperature (both phenomena are more frequent in summer)
  • Per capita income (both phenomena are more likely in cities with a higher average income)
  • Population size (both phenomena are related to the number of people in the city)

With these “hidden” variables, we can solve the mystery of the inexplicable correlations: even though A and B aren’t directly connected, variable A is linked to C (the latent variable), and variable C is linked to B.

Obviously, depending on the complexity of the study, it can be very difficult to understand whether a high correlation index is due to a cause-and-effect phenomenon, a spurious correlation, or neither. The key point is the need to carefully interpret every correlation index to avoid drawing totally wrong conclusions.

So, what’s the takeaway from these examples? I’d say it’s a kind of demystification of “numbercracy,” the idea that numbers have the power to explain reality uncritically. Numbers and statistical indices are useful, indeed incredibly useful, for understanding real-world phenomena, but they always require interpretation and critical judgment before being accepted as dogma and conveying a potentially incorrect meaning.

But just to be safe, the next time a Nicholas Cage movie comes out… stay away from the pools! 🙂

The cake paradox

A few days ago at work, I found myself in one of those situations where numbers behave in a counterintuitive way. A situation where an apparently logical reasoning leads to incorrect conclusions—what I call the cake paradox.

I brought up cakes because I used them to explain (with difficulty) to my colleagues where the trick was.

Let’s go over everything with a similar example (made-up data for illustrative purposes):

  • “Leggo” is an Italian publishing company active in two markets: Italy and China.
  • In Italy, “Leggo” is the market leader, with a 50% market share in 2015 (meaning, for every 100 books sold, 50 are published by “Leggo”).
  • In China, “Leggo” has a 5% market share in 2015.
  • Given the size differences of the markets, “Leggo” has a total market share of 9.1% (aggregating data from Italy and China, with calculations available in the table below).
  • In 2016, “Leggo” increases its sales and manages to steal market share from its competitors in both Italy and China.
    • In Italy, it goes from 50% to 52% market share.
    • In China, it goes from 5% to 6%.

Even though “Leggo” performed better than its competitors in both markets, the overall market share (China + Italy) decreased from 9.1% to 8.8%.
Does something not add up?

How is it possible that even though “Leggo” is doing very well, increasing its sales in both markets, stealing market share both in Italy and China, its total market share is decreasing?

Try it for yourself; all the calculations are in the table:

ItalyChinaTotal
2015 Book market 100 m1,000 m1,100 m
2015 “Leggo” sales50 m50 m100 m 
2015 Market share50%5%9.1%
2016 Book market102 m1,600 m1,703 m
2016 market growth2%60%55%
2016 “Leggo” sales53 m96 m149 m
2016 market share52%6.0%8.8%

The paradox is explained by the two different markets: Italy and China. Italy is much smaller than China, and China is growing at a much higher rate than Italy.

Forget the example of books with realistic numbers; let’s abstract the concept by thinking of the cakes mentioned earlier.

We have two cakes of similar size, one white and one black. Initially, we own 80% of the white cake and only 20% of the black cake.
In total, we own the equivalent of one entire cake (meaning our market share will be 50%, one cake out of two).

Suddenly, the black cake rises and becomes 100 times the size of the white cake.
Our total market share drastically drops from the previous 50% to a number close to 20%, the market share of the huge cake (exactly 20.6%).

Now, it should be easier to understand how our market share, which was previously an average of 20% and 80% (since the cake were the same size), will now be much closer to 20% of the black cake, which is 100 times larger than the white cake.

In practice, our dominant position in the white cake, which used to balance out the black cake (since they were of similar size), now barely matters in a situation where the black cake represents almost the entire market.

The Cake Paradox shows us how it can be detrimental for a company to hold a leadership position in a slow-growth market. Even if it outperforms its competitors in emerging markets, it can still lose market share overall!

The useleness of absolute numbers

A few days ago, I was reading an article about accidents involving cyclists. Being an avid cyclist and dealing with data and numbers of all kinds every day, I immediately noticed this sentence: “The regions most affected by accidents are those where bicycles are a real tradition: Lombardy, Veneto, Emilia Romagna, and Tuscany. Incidents tend to occur on Saturdays and Sundays, between 10 AM and 12 PM, during the months of May to October, with a peak in August.”

What seems odd to you?

After reflecting for a moment, it’s clear that the regions, days, and times when accidents are most frequent are simply those when cyclists are most frequently on the road. This is a type of error or oversight that’s fairly common among journalists, who, being less familiar with numbers, often report data without critically analyzing it.

Saying that accidents happen more frequently on Saturdays and Sundays doesn’t provide any useful information, because those are the two days when cyclists are on the roads the most, and thus are also the days with the highest risk of accidents (the same logic applies to the most popular months and times of day). In this case, the raw data doesn’t make any sense unless it’s somehow “cleaned up.”

What needs to be done in such cases is to have a “benchmark” to compare the results (what’s referred to as a “benchmark” in English). In the case of our article, a simple benchmark could be the ratio between the number of accidents and the number of cyclists on the road that day. This means that, instead of looking at the absolute number of accidents, we look at the relative number. By doing this, we give each day of the week the same “probability” of being the most dangerous day, removing the natural advantage that days like Saturday or Sunday have due to the higher number of cyclists.

Let’s take a look at the table below (note that the numbers are fictional):

DayNumber of accidentsNumber of cyclistsRatio
Mon101.0001,0%
Tue152.0000,8%
Wed101.5000,7%
Thu151.0001,5%
Fri203.0000,7%
Sat408.0000,5%
Sun6010.0000,6%

If we consider the absolute number of accidents, Sunday is the most dangerous day with 60 accidents. However, if we use the correct benchmark, dividing the number of accidents by the number of cyclists on the road, the most dangerous day in relative terms becomes Thursday, with a ratio of 1.5%.

A similar example of this concept is found in marketing: the effectiveness of an advertising campaign targeted at a certain group of people is evaluated not only by looking at the absolute value, but by comparing it to the results of a “control group” – a group of customers who are not exposed to the campaign. Only by “relativizing” the absolute results can we determine whether the campaign was effective or not.

So, when you come across statistics and conclusions like this, pay attention to the data and ask yourself whether they’ve been properly analyzed or not, because otherwise, they might not make any sense.

And anyway, to stay safe and avoid articles like this, when you go out cycling, be careful around cars!

The Gambler’s Ruin Theorem

I’ve always been fascinated by gambling: the adrenaline generated by risk, the dream of a big win.

Since I first understood how roulette works, I’ve been struck by how “fair” this game is.

When you do the simple math, betting €1 on any number gives you a 1/37 chance of winning, and in the case of a win, the house pays 36 times the stake. Playing repeatedly, the house wins on average only 2.7% of the total amount bet (1/37). Blackjack is even more “fair.” Played by the rules, the house edge drops to as low as 0.5%.

When we think of games more familiar to us, like the Lotto, a single number has a 1 in 18 chance of being drawn and pays 10.23 times the stake (a margin of 38%). This margin doesn’t change whether the number is overdue or not, as Giovanni explained to us here.

The “Cinquina” (five-number match), however, is a real theft, with a margin of 86%. That is, for every €100 wagered, the house pays out only €14 in winnings on average (compared to €97.3 for roulette!). Similarly popular sports betting can have a margin of up to 40–50% when played as a parlay.

House Edge by Game Type

GiocoMargine del banco
BlackJack0.5-1.5%
Roulette2.7%
Slot machine3-10%
Scommessa sportiva* (in singola)3-10%
Scommessa sportiva* (in multipla)20-50%
Lotto - Estratto38%
Lotto - Cinquina86%

*Indicative margins for sports betting.

After appreciating the low margins of casino games, I discovered something else that decisively tilts the odds in the house’s favor: the so-called “Gambler’s Ruin Theorem,” which highlights the significant difference between a casino’s capital and that of the player.

Using the chart, it’s easy to see how, even in a fair game, a player spending an evening at the casino is almost doomed to lose.

Let’s assume the player is willing to lose €15. The house, with far greater resources, only needs to wait for fortune’s fluctuations to push the player to the point of no return (in the chart, this happens on the 430th round). While the player’s balance oscillates around zero, the game ends when the player runs out of funds, whereas the house essentially has no loss limit.

The chart demonstrates this with a zero-margin game (perfect fairness). In the presence of even a small house edge, the player will reach the point of no return even faster.

The theorem can be summarized by this formula, calculating the probability of the player’s ruin in a fair game (zero margin), assuming play continues until either the house or the player runs out of capital:

Player’s Probability of Ruin = House Capital / (Player Capital + House Capital)

Even with modestly different amounts (Player = €15, House = €40), the odds are heavily skewed, with the player going broke in 72% of cases. Considering the house usually has vastly greater capital than the player, the probability of ruin approaches 100%.

At this point, it should be clear that the low house edge in casino games does not limit the house’s profitability.

In light of this, the next time you spend an evening at the casino, either bet all your capital in one go (and nearly break even) or simply enjoy the price of an entertaining evening!

Don’t trust the latecomers

When creating a blog, one of the first things to do is to come up with a name. Before choosing datastory.it, we considered several options, but some of them were already taken. On one of these sites, we came across a phrase that made our few remaining hairs stand on end. It went something like this:
“This site contains an algorithm capable of generating Lotto numbers that are more likely to be drawn than others.”

Such words sound to a statistician like a blasphemy sounds to a priest. Have you ever heard of “hot numbers” or “overdue numbers”? Surely you have. Well, we can guarantee you that these numbers are meaningless, and there is no algorithm capable of generating numbers more likely to be drawn than others. Let’s explore why.

The Lotto, or any similar game, consists of 90 balls, each with an equal probability of being drawn: 1/90, or about 1.1%. So far, so good. But what if you were told that after 100 draws, all the numbers from 1 to 90 had been drawn except one, say 27? Would it change your strategy? Would you bet on 27 because it’s overdue? The answer is an emphatic no because the probability remains the same for all numbers. There are at least two reasons for this.

Intuition – The balls in the drum are not influenced by previous draws. There is no reason why a ball should become larger, smaller, hotter, or colder depending on how many times it has been drawn (or not drawn) in the past. Each ball’s probability is always equal to that of the others, even if it hasn’t been drawn for a thousand consecutive rounds.

Probability Theory – This phenomenon can be described by a random variable following a geometric distribution, which measures the probability that the first success occurs after a certain number of trials (each with equal probability—in this case, 1/90). It can be shown that the overall probability remains exactly the same for both the n-th draw and the m-th draw, where m is greater than n. Thus, we arrive at the same conclusion: the drum has no memory.

The “Hot Numbers” Misconception – Where do these so-called Lotto experts go wrong with their theories on overdue numbers? They often invoke the nebulous dogma known as the Law of Large Numbers. While this is a complex topic that will be discussed separately on this blog, the law essentially states that as the number of trials approaches infinity, the frequency of each outcome converges to its theoretical probability—in this case, 1/90. For example, after 90 million draws, each number will have been drawn approximately 1 million times.

However, this law does not imply that after a certain number of draws, the probability of a number increases because it has been drawn less often than others. The reasoning of those who promote overdue numbers would only hold true if there were a finite number of draws. In such a case, if by the second-to-last draw one number had been drawn less frequently than the others, it would logically be the one drawn in the final round. But the Law of Large Numbers refers to an infinite number of trials, so with every draw, it’s as if everything resets. From that point on, the expected frequency of every number remains 1/90 for all future draws.

What we’ve discussed so far doesn’t just apply to the Lotto but extends to all situations involving independent repetitions of the same event, such as rolling dice or betting on numbers and colors in roulette. For example, if after 10 spins of a roulette wheel, red has come up every time, what do you expect on the next spin? If you think black is more likely, you’re deeply mistaken: the probability doesn’t change—it remains 50-50 for red and black (ignoring the green zero).

So, if you decide to play the Lotto or any similar game, don’t trust anyone urging you to bet on overdue numbers. Doing so would be no different than choosing numbers based on your birthdate or the last digits of your license plate.

The Sad Story of the Inductivist Turkey

It’s Christmas dinner, an allegory of abundance and a stage for opulence. Your neighbor at the table, probably a fourth cousin whose name you barely remember, is starting to show signs of giving up and is desperately seeking your complicit gaze. But with feigned nonchalance and reckless boldness, you act as if you’re still hungry, even though the amount of food you’ve just consumed could satisfy the caloric needs of the entire province of Isernia. Then, as the third hour of dinner strikes, a new, succulent course is brought out: a stuffed turkey.

At that moment, in a fleeting pang of consciousness – typically left at home during such occasions (otherwise, how else could one explain such an absurd amount of food?) – you wonder about the story behind the turkey in front of you.

This turkey lived on a farm where, from day one, it was fed regularly. The turkey noticed that food was brought every day at the same time, regardless of the season, weather, or other external factors.

Over time, it began to derive a general rule based on repeated observation of reality. It began to embrace an inductivist worldview, collecting so many observations that it eventually made the following assertion:

“Every day at the same time, they will bring me food.”

Satisfied and convinced by its inductivist reasoning, the turkey continued to live this way for several months. Unfortunately for the turkey, its assertion was spectacularly disproven on Christmas Eve when its owner approached as usual, but instead of bringing food, he slaughtered it to serve at the very Christmas dinner you are attending.

The Turkey and Inductivism

This sad story is actually a famous metaphor developed by Welsh philosopher Bertrand Russell in the early 20th century. It clearly and simply refutes the idea that repeated observation of a phenomenon can lead to a general assertion with absolute certainty. The story of the inductivist turkey dates back to a time when Russell opposed the ideas of the Vienna Circle’s neopositivists, who placed unconditional trust in science—particularly inductivism—and regarded it as the only possible means of acquiring knowledge.

The turkey’s example was later adopted by Austrian philosopher Karl Popper, who used it to support his principle of falsifiability. According to this theory—one of the 20th century’s most brilliant—science progresses through deductions that are never definitive and can always be falsified, meaning disproven by reality. There is no science if the truths it produces are immutable and unfalsifiable. Without falsifiability, there can be no progress, stimulation, or debate.

What Does This Mean for the Turkey?

Returning to the turkey’s situation, does this mean it’s impossible to draw conclusions based on experience? Of course not. The study of specific cases helps us understand the general phenomenon we’re investigating and can lay the groundwork for developing general laws. However, the truth of any conclusion we reach is never guaranteed. In simpler terms, if a flock of sheep passes by and we see 100 white sheep in a row, that doesn’t mean the next one will also be white. From an even more pragmatic perspective, no number of observations can guarantee absolute conclusions about the phenomenon in question.

Implications for Statistics and Inference

Statistics, and particularly inference—a core component of statistics—derive their philosophical foundations from this concept. The purpose of inference is to draw general conclusions based on partial observations of reality, or a sample.

For example, let’s say we want to estimate the average number of guests at a Christmas dinner. How would we do that? Let’s set aside the turkey for a moment, put down our forks and knives, and imagine we have a sample of 100 Christmas dinners where we count the number of guests. Based on a fundamental theorem of statistics known as the Central Limit Theorem, we can assert that the average number of guests observed in our sample is a correct estimate of the true population mean (provided the sample is representative and unbiased, but that’s a topic for another day). Moreover, the error in this estimate decreases as the sample size increases. In other words, the more dinners we include in our sample, the more robust and accurate the estimate becomes. Logical, right?

But how certain are we that our estimate is correct? Suppose we’ve determined that the average number of guests across 100 dinners is 10. From this observation, we can also calculate an interval within which the true average is likely to fall. With a sample of 100 units, we can assert with a certain level of confidence (typically 95%) that the true average number of guests is between 7 and 13. With a sample of 200 units, our estimate becomes more precise, narrowing the interval to 8 and 12. The larger the sample, the more accurate the estimate.

Absolute Confidence? The Turkey’s Warning

These estimates are valid with a 95% confidence level. But what if we wanted 100% confidence? Would it be possible? Here’s where our inductivist turkey makes its comeback. If we wanted 100% confidence, we would fall into the same trap as the turkey—attempting to draw conclusions with absolute certainty from a series of observations. As we’ve seen, at the turkey’s expense, this is impossible. The explanation is simple: even with a large and representative sample, it’s never possible to completely eliminate the influence of chance. There’s always a small probability that we’ll encounter an observation—like a Christmas dinner with more or fewer guests than our confidence interval predicts—that contradicts our estimates.

Thus, what statistics can offer in such cases is a very robust estimate of the parameter we’re studying (instead of the number of dinner guests, think about something more critical, like the average income of a population, the efficacy of a drug, or election polls). However, it can never provide absolute certainty about a phenomenon. This is because the world we live in is not deterministic but is partly governed by chance. In this sense, statistics is a science that demonstrates the “non-absoluteness” of other sciences, which is perhaps why it is often feared or disliked.

After all, statistics reached its peak development in the 20th century, the century of relativism—think of Einstein’s theory of relativity, Heisenberg’s uncertainty principle, or Popper’s criterion of falsifiability.

Now, it’s time to eat the turkey before it gets cold!

A statistical approach to terrorism

Datastory.it is also about current events, and following the attacks in Paris, we want to share our opinion on the matter.

The series of attacks that struck the French capital on November 13, 2015, seems to have shaken public opinion and mobilized European governments. In newspapers, parliaments, and international forums, the primary focus is how to ensure safety and prevent the horrific events in Paris from happening again. Many hypotheses are being considered: stricter border controls, revising the Schengen Agreement, increased surveillance in high-risk areas, and the installation of cameras in major cities.

Additionally, there are discussions about allocating more personnel and resources to security (the press mentions €400 million in Belgium, €120 million in Italy). And then there are the bombings in Iraq and Syria, with the United States, Russia, and France taking the lead. Some estimates suggest the U.S. spends $10 million daily on these operations, while Russia spends about a third of that amount.

The caliphate in Iraq and Syria is undoubtedly a threat to the Western world. More broadly, in recent years, Islamist terrorism has caused deaths and suffering even in Europe.

From the Madrid bombings on March 11, 2004, to the Paris attacks on November 13, 2015, 411 people have died in seven Islamist terrorist attacks. The number rises to 488 if we include the July 22, 2011, attack in Norway by Breivik (which had an anti-Islamic motivation).

But how much does it cost us to protect ourselves from terrorist attacks? How far are we willing to go to prevent more lives from being lost to fanatics killing in the name of their religion?

And in a world where resources are limited, what are we willing to give up to increase our security?

Let’s reflect on these questions using some concrete examples. In Italy alone, around 1,000 people die each year in workplace accidents (source: Osservatorio sui morti del lavoro), 3,385 people in road accidents (2013, source: ISTAT), 12,004 women from breast cancer (2012, source: ISTAT), and an astonishing 83,000 deaths are attributable to smoking!

How many lives could be saved if the significant resources budgeted for defense against terrorist attacks were instead used for anti-smoking campaigns? If millions were invested in securing the road network and law enforcement increased traffic and alcohol checks? How many of the 12,000 women who die each year from breast cancer could be saved if we doubled free mammograms?

It’s difficult to answer these questions (though we will delve deeper into this topic), but let’s make a few assumptions. Let’s take the €120 million the government has declared it will allocate to protect us from terrorist attacks following the Paris events. We could decide to allocate them in three different ways, as outlined in the table below:

InvestmentImpact (Assumption)
Informational campaign on the dangers of smoking in all schools and free copies of the book “The Easy Way to Stop Smoking” for all smokers who request it1% reduction in smoking-related deaths: 830 lives saved per year
Doubling free mammograms (every year instead of every two years)10% reduction in breast cancer deaths annually: 1,200 lives saved per year
Installing speed enforcement systems on all highways and tripling alcohol-related checks5% reduction in road deaths: 170 lives saved per year

How should we decide how to allocate the €120 million? Is it wise to invest it in counterterrorism (which has caused 488 deaths across Europe in the past 15 years) rather than in breast cancer prevention, which could save 1,200 women per year in Italy alone?

It’s legitimate to think that the Paris attacks did not just cause 136 deaths but also created a sense of insecurity. But how do we determine the value of a life lost to terrorism compared to one lost in a road accident?

How can we allocate our €120 million wisely, rather than being emotionally swayed by recent events?

What do you think? Share your thoughts in the comments!

What’s Datastory.it?

Datastory.it is a forge of numbers, information, and impressions about the reality that surrounds us.

Not just a container, but a workshop full of tools where raw data is analyzed and refined until an essential essence of information emerges. Like artisans of numbers, we will shape data, breathe life into it, and make it a valuable aid in interpreting reality.

Scientific data will be the guiding star that leads us through the events of the world around us. But the paths to reach our destination can be many and vastly different from one another.

Data is unique yet contradictory, unequivocal yet ambiguous – a fundamental pillar of one theory and the cornerstone of its exact opposite. Those who work with numbers know that what truly matters is not the data itself, but the interpretation given to it, and consequently the “story” built around it.

Our goal is to go beyond the first impression of a number, to avoid taking the easiest path simply because it seems straightforward and free of pitfalls. Instead, we will strive to analyze data in all its myriad facets and interpret reality in unconventional ways—sometimes provocative or irreverent.

But we won’t bore you with just numbers and stories about numbers. We’ll also delve into the world of those who work with numbers (a world we’re part of). And finally, we’ll use this platform to share our stories and ideas—please forgive us if some posts stray a bit off-topic.

Welcome aboard, and enjoy the journey.

“…few people will appreciate the music if I just show them the notes. Most of us need to listen to the music to understand how beautiful it is. But often that’s how we present statistics; we just show the notes, we don’t play the music.”
—Hans Rosling

© 2025 Datastory.it

Theme by Anders NorenUp ↑