The useleness of absolute numbers

A few days ago, I was reading an article about accidents involving cyclists. Being an avid cyclist and dealing with data and numbers of all kinds every day, I immediately noticed this sentence: “The regions most affected by accidents are those where bicycles are a real tradition: Lombardy, Veneto, Emilia Romagna, and Tuscany. Incidents tend to occur on Saturdays and Sundays, between 10 AM and 12 PM, during the months of May to October, with a peak in August.”

What seems odd to you?

After reflecting for a moment, it’s clear that the regions, days, and times when accidents are most frequent are simply those when cyclists are most frequently on the road. This is a type of error or oversight that’s fairly common among journalists, who, being less familiar with numbers, often report data without critically analyzing it.

Saying that accidents happen more frequently on Saturdays and Sundays doesn’t provide any useful information, because those are the two days when cyclists are on the roads the most, and thus are also the days with the highest risk of accidents (the same logic applies to the most popular months and times of day). In this case, the raw data doesn’t make any sense unless it’s somehow “cleaned up.”

What needs to be done in such cases is to have a “benchmark” to compare the results (what’s referred to as a “benchmark” in English). In the case of our article, a simple benchmark could be the ratio between the number of accidents and the number of cyclists on the road that day. This means that, instead of looking at the absolute number of accidents, we look at the relative number. By doing this, we give each day of the week the same “probability” of being the most dangerous day, removing the natural advantage that days like Saturday or Sunday have due to the higher number of cyclists.

Let’s take a look at the table below (note that the numbers are fictional):

DayNumber of accidentsNumber of cyclistsRatio
Mon101.0001,0%
Tue152.0000,8%
Wed101.5000,7%
Thu151.0001,5%
Fri203.0000,7%
Sat408.0000,5%
Sun6010.0000,6%

If we consider the absolute number of accidents, Sunday is the most dangerous day with 60 accidents. However, if we use the correct benchmark, dividing the number of accidents by the number of cyclists on the road, the most dangerous day in relative terms becomes Thursday, with a ratio of 1.5%.

A similar example of this concept is found in marketing: the effectiveness of an advertising campaign targeted at a certain group of people is evaluated not only by looking at the absolute value, but by comparing it to the results of a “control group” – a group of customers who are not exposed to the campaign. Only by “relativizing” the absolute results can we determine whether the campaign was effective or not.

So, when you come across statistics and conclusions like this, pay attention to the data and ask yourself whether they’ve been properly analyzed or not, because otherwise, they might not make any sense.

And anyway, to stay safe and avoid articles like this, when you go out cycling, be careful around cars!

The Sad Story of the Inductivist Turkey

It’s Christmas dinner, an allegory of abundance and a stage for opulence. Your neighbor at the table, probably a fourth cousin whose name you barely remember, is starting to show signs of giving up and is desperately seeking your complicit gaze. But with feigned nonchalance and reckless boldness, you act as if you’re still hungry, even though the amount of food you’ve just consumed could satisfy the caloric needs of the entire province of Isernia. Then, as the third hour of dinner strikes, a new, succulent course is brought out: a stuffed turkey.

At that moment, in a fleeting pang of consciousness – typically left at home during such occasions (otherwise, how else could one explain such an absurd amount of food?) – you wonder about the story behind the turkey in front of you.

This turkey lived on a farm where, from day one, it was fed regularly. The turkey noticed that food was brought every day at the same time, regardless of the season, weather, or other external factors.

Over time, it began to derive a general rule based on repeated observation of reality. It began to embrace an inductivist worldview, collecting so many observations that it eventually made the following assertion:

“Every day at the same time, they will bring me food.”

Satisfied and convinced by its inductivist reasoning, the turkey continued to live this way for several months. Unfortunately for the turkey, its assertion was spectacularly disproven on Christmas Eve when its owner approached as usual, but instead of bringing food, he slaughtered it to serve at the very Christmas dinner you are attending.

The Turkey and Inductivism

This sad story is actually a famous metaphor developed by Welsh philosopher Bertrand Russell in the early 20th century. It clearly and simply refutes the idea that repeated observation of a phenomenon can lead to a general assertion with absolute certainty. The story of the inductivist turkey dates back to a time when Russell opposed the ideas of the Vienna Circle’s neopositivists, who placed unconditional trust in science—particularly inductivism—and regarded it as the only possible means of acquiring knowledge.

The turkey’s example was later adopted by Austrian philosopher Karl Popper, who used it to support his principle of falsifiability. According to this theory—one of the 20th century’s most brilliant—science progresses through deductions that are never definitive and can always be falsified, meaning disproven by reality. There is no science if the truths it produces are immutable and unfalsifiable. Without falsifiability, there can be no progress, stimulation, or debate.

What Does This Mean for the Turkey?

Returning to the turkey’s situation, does this mean it’s impossible to draw conclusions based on experience? Of course not. The study of specific cases helps us understand the general phenomenon we’re investigating and can lay the groundwork for developing general laws. However, the truth of any conclusion we reach is never guaranteed. In simpler terms, if a flock of sheep passes by and we see 100 white sheep in a row, that doesn’t mean the next one will also be white. From an even more pragmatic perspective, no number of observations can guarantee absolute conclusions about the phenomenon in question.

Implications for Statistics and Inference

Statistics, and particularly inference—a core component of statistics—derive their philosophical foundations from this concept. The purpose of inference is to draw general conclusions based on partial observations of reality, or a sample.

For example, let’s say we want to estimate the average number of guests at a Christmas dinner. How would we do that? Let’s set aside the turkey for a moment, put down our forks and knives, and imagine we have a sample of 100 Christmas dinners where we count the number of guests. Based on a fundamental theorem of statistics known as the Central Limit Theorem, we can assert that the average number of guests observed in our sample is a correct estimate of the true population mean (provided the sample is representative and unbiased, but that’s a topic for another day). Moreover, the error in this estimate decreases as the sample size increases. In other words, the more dinners we include in our sample, the more robust and accurate the estimate becomes. Logical, right?

But how certain are we that our estimate is correct? Suppose we’ve determined that the average number of guests across 100 dinners is 10. From this observation, we can also calculate an interval within which the true average is likely to fall. With a sample of 100 units, we can assert with a certain level of confidence (typically 95%) that the true average number of guests is between 7 and 13. With a sample of 200 units, our estimate becomes more precise, narrowing the interval to 8 and 12. The larger the sample, the more accurate the estimate.

Absolute Confidence? The Turkey’s Warning

These estimates are valid with a 95% confidence level. But what if we wanted 100% confidence? Would it be possible? Here’s where our inductivist turkey makes its comeback. If we wanted 100% confidence, we would fall into the same trap as the turkey—attempting to draw conclusions with absolute certainty from a series of observations. As we’ve seen, at the turkey’s expense, this is impossible. The explanation is simple: even with a large and representative sample, it’s never possible to completely eliminate the influence of chance. There’s always a small probability that we’ll encounter an observation—like a Christmas dinner with more or fewer guests than our confidence interval predicts—that contradicts our estimates.

Thus, what statistics can offer in such cases is a very robust estimate of the parameter we’re studying (instead of the number of dinner guests, think about something more critical, like the average income of a population, the efficacy of a drug, or election polls). However, it can never provide absolute certainty about a phenomenon. This is because the world we live in is not deterministic but is partly governed by chance. In this sense, statistics is a science that demonstrates the “non-absoluteness” of other sciences, which is perhaps why it is often feared or disliked.

After all, statistics reached its peak development in the 20th century, the century of relativism—think of Einstein’s theory of relativity, Heisenberg’s uncertainty principle, or Popper’s criterion of falsifiability.

Now, it’s time to eat the turkey before it gets cold!

Datastory.it al KNIME Spring Summit 2016

Il 24-26 febbraio 2016 datastory.it parteciperà al KNIME Spring Summit 2016 a Berlino, conferenza annuale degli utilizzatori di KNIME. KNIME è un data analytic software non molto diffuso ma molto apprezzato da chi lo utilizza, confermato anche nel 2016 tra i migliori software della sua categoria secondo la prestigiosa Gartner.

La cosa più bella di KNIME è che lo potete scaricare direttamente dal proprio sito internet essendo totalmente gratuito e open source.

I non addetti ai lavori si staranno chiedendo a cosa serve KNIME. Di seguito alcuni esempi:

Continue reading

Diffidate dei ritardatari

Quando si crea un blog, una delle prime cose da fare è trovare un nome. Prima di scegliere il nome datastory.it, abbiamo valutato varie proposte, ma alcune di esse appartenevano a domini già occupati. Su uno di questi siti abbiamo trovato una frase che ci ha fatto rizzare quei pochi capelli rimasti in testa, e che recitava più o meno una cosa del genere: “Questo sito contiene un algoritmo capace di generare numeri per il gioco del Lotto che hanno una probabilità maggiore di essere estratti rispetto agli altri”.

Ecco, parole del genere suonano alle orecchie di uno statistico più o meno come una bestemmia suona alle orecchie di un prete. Avete mai sentito parlare di “numeri caldi” o “ritardatari”? Sicuramente si. Ebbene, vi possiamo garantire che questi numeri non hanno alcun senso e quindi non c’è nessun algoritmo capace di generare numeri più probabili di altri. Cerchiamo di capirne il perché.

Continue reading

Un approccio statistico al terrorismo

Datastory.it è anche attualità e a seguito degli attentati di Parigi vogliamo condividere la nostra opinione a riguardo.

La serie di attacchi che ha colpito la capitale francese il 13 novembre 2015 sembra aver scosso l’opinione pubblica e mobilitato i governi europei. Sui giornali, nei parlamenti e nei consessi internazionali non si parla d’altro che di come garantire la sicurezza ed evitare che gli atroci avvenimenti di Parigi possano ripetersi. Molte le ipotesi al vaglio, ho sentito parlare di controlli più stringenti alle frontiere, revisione del trattato di Schengen, sorveglianza aumentata nei luoghi a rischio, installazione di telecamere nelle grandi città.

E ancora, più uomini e risorse per il comparto sicurezza (la stampa parla di 400milioni di euro in Belgio, 120 in Italia). E poi i bombardamenti in Iraq e Siria con Stati Uniti, Russia e Francia tra i protagonisti. Alcune stime parlano di 10 milioni di dollari al giorno spesi dagli Stati Uniti, circa un terzo dalla Russia.

Continue reading

Si parte! Cos’è datastory.it?

Datastory.it è una fucina di numeri, informazioni e impressioni sulla realtà che ci circonda. Non un semplice contenitore, ma una bottega piena di attrezzi dove il dato grezzo viene analizzato e filtrato fino ad estrapolarne un contenuto essenziale di informazione. Come artigiani dei numeri, plasmeremo il dato e gli daremo linfa vitale per farlo diventare un valido aiuto nell’interpretazione della realtà.

Il dato scientifico sarà la stella polare che guiderà il nostro cammino attraverso gli avvenimenti della realtà che ci circonda. Ma le rotte da seguire per arrivare a destinazione possono essere molteplici e ben diverse tra loro.

Il dato infatti è univoco ma contraddittorio, è inequivocabile ma al tempo stesso ambiguo, perno fondamentale di una teoria e colonna portante del suo esatto opposto. Chi lavora con i numeri sa che quello che conta non è il dato in sé, ma l’interpretazione che gli si dà, e di conseguenza la “storia” che ci si costruisce intorno.

La nostra intenzione è quella di non fermarsi alla prima impressione del numero, di non imboccare la strada più semplice perché apparentemente immediata e priva di trappole. Al contrario, cercheremo di analizzare il dato in tutte le sue mille sfaccettature e interpretare la realtà in modo inusuale, talvolta anche provocatorio o dissacrante.

Ma non vi annoieremo solamente con numeri e storie di numeri, cercheremo di raccontare anche il mondo di chi lavora con i numeri (mondo del quale facciamo parte). E da ultimo, ci riserveremo di utilizzare questa piattaforma per condividere le nostre storie e le nostre idee, perdonateci se qualche post risulterà essere un po’ fuori tema.

Benvenuti a bordo e buon ascolto…

“…few people will appreciate the music if I just show them the notes. Most of us need to listen to the music to understand how beautiful it is. But often that’s how we present statistics; we just show the notes we don’t play the music.” 
Hans Rosling

 

© 2025 Datastory.it

Theme by Anders NorenUp ↑