My Favourite Theorem (Spoiler: It’s Bayes)

Thirty years ago, I was invited to interview for a place at Oxford University. I travelled down from my hometown of Liverpool and at the allotted time entered a fusty room at a storied college. Inside, sitting around a horseshoe table, were seven men. As the interview progressed it became clear that, like the seven dwarves, they each displayed a defining characteristic. One was barely awake, one smiled continuously, one asked the hard questions and so on. It was good cop, bad cop cubed. (Minus one to be exact, because this was after all a maths interview). To this day I recall with anguish the killer question, the one that tripped me up. “What’s your favourite theorem?”.

I froze and after a few long seconds I blurted out something inconsequential. Happy’s smile turned to a smirk, Sleepy grunted and Doc moved the interview on.

It took many years for me to finally identify my favourite theorem.


Thomas Bayes was an eighteenth-century English minister with interests in theology and mathematics. On his death in 1761 a manuscript he’d written passed into the hands of his friend Richard Price. The manuscript included the basis for an essay that was published two years later. The essay was about probability. While the probabilistic link between cause and effect had been worked over by other mathematicians, until Bayes no-one had figured out how to solve the inverse question – what can we tell about the cause of something from its effect? Bayes’ idea was to harness observations about the world to say something about the past, to varying degrees of confidence. Conceptually his idea addresses how we should adjust our estimates of probabilities when we encounter new data that influence a situation.

In shorthand the idea is this: Initial Belief + New Data –> Improved Belief.

Or, as the terms went on to become known: Prior + Likelihood of new observation given competing hypothesis –> Posterior.

Framed like that, the idea is completely intuitive. Bayes’ friend Richard Price presented the example of a person who emerges into the world and sees the sun rise for the first time. Initially, he doesn’t know whether this is typical or some sort of freak occurrence. However, each day he witnesses the sun rise again his confidence increases that it is a permanent feature of nature. Gradually, through this purely statistical form of inference, the probability he assigns to his prediction that the sun will rise again tomorrow approaches 100 percent. It never reaches precisely 100 percent but it gets close.

From philosophy to psychology

But not all priors are as plain as the sun rising, and sometimes we forget them. This is where Bayes shifts from philosophy into psychology. Frequently we are so wowed by new data, we forget to give it the context of the old. In his book Thinking Fast and Slow, Daniel Kahneman cites a study in which he asks people to rank the subjects that a university student ‘Tom W’ is likely to be studying from a list that includes humanities, computer science, engineering and other degrees. Unsurprisingly people rank humanities higher than computer science because it is generally a more popular degree. Kahneman then offers the following personality sketch of Tom W “written during Tom’s senior year in high school by a psychologist, on the basis of psychological tests of uncertain validity”:

Tom W is of high intelligence, although lacking in true creativity. He has a need for order and clarity, and for neat and tidy systems in which every detail finds its appropriate place. His writing is rather dull and mechanical, occasionally enlivened by somewhat corny puns and flashes of imagination of the sci-fi type. He has a strong drive for competence. He seems to have little feel and little sympathy for other people, and does not enjoy interacting with others. Self-centred, he nonetheless has a deep moral sense.

On seeing this, most people reach for the stereotype computer science or engineering student they hold in their head and revise their guess of Tom W’s subject straight to one of those. In so doing they forget completely their prior assessment that humanities students outnumber computer science/engineering students and so even if it may be more likely than before that Tom W is a computer science student, the chances are still stacked towards him being a humanities student. People tend to overweight the new information and underweight the old. Kahneman uses the example as a demonstration of his ‘representativeness heuristic’. But ignoring prior information can also be a feature of the ‘availability heuristic’ where we focus on the newest information that is at our fingertips or on screens in front of us. And given our innate love of stories, our deeper appreciation of narrative over statistics, a good story will typically command a much higher weighting in our mind than any statistical priors.

Bayesian forecasting

The same holds in the field of investing. Kahneman generalises his observations in a framework that differentiates between the outside view of a situation and the inside view. The outside view considers a sweep of similar cases and the inside view considers the specifics of the case in question. In the absence of other information, the outside view can be considered a proxy for the priors. This framework is especially useful when thinking about company earnings forecasts. Michael Mauboussin has conducted extensive research on ‘base rates’ of company performance that provide reference classes on which to base an outside view. He gives the example of Amazon, where (back in 2015) he saw one analyst estimate 15% p.a. revenue growth out for a full ten years. A focus on Amazon itself may well yield such a forecast. Yet since 1950 a look at all companies with initial revenues of US$100bn+ (adjusted for inflation) will show that not a single one grew revenue by 15% p.a. over ten years. Seven companies, out of 313 in the sample, grew revenue by 10% p.a., but not one by 15%.

In any form of forecasting, Bayes has been shown to be an effective tool. Philip Tetlock in his book, Superforecasting, cites research showing that the best forecasters are markedly better ‘Bayesian thinkers’ than others. The research singles out the willingness of superforecasters to change their minds in response to new evidence. Tetlock trains forecasters to tackle problems by breaking them down into smaller questions and using Bayes to drive their forecasts.

Empirically a superforecaster’s ability hinges on them continually evaluating evidence and weighting it appropriately. So-called ‘experts’ rarely do this, sticking to their priors regardless of disconfirming evidence. Anything multiplied by zero is zero, so a prior of zero means their views aren’t changing. Examples of this are prevalent in politics (“You turn if you want to. The lady’s not for turning”). But politics is not unique – there are many fields where, for various reasons, people have little incentive to shift their view.

The other extreme is not helpful to forecasting either – pivoting completely towards the last thing someone said. That’s what sent those involved in Kahneman’s Tom W study off kilter. It’s also what sends sports fans off kilter (“But he’s in good form”) and it can send investors off kilter, too (“This time is different”). This is where filtering signal from noise is critical (yet hard). All the more so because the news doesn’t reflect the proper frequency of events, over-representing plane crashes and terrorist attacks for example. Indeed, the very best stories are edge cases from which we should protect our priors.

Spiky priors

Splitting a decision process into priors and observations sheds light on many other problems we face. In his book How Not To Be Wrong, Jordan Ellenberg asks why the series RBRRB looks random while the series RRRRR doesn’t. He suggests that we have a prior belief that the system that generated the series, in his case a roulette wheel, may be rigged and as observations come in, we update that belief. Even though our trusting selves may place a very low likelihood on a rigged system, when five reds in a row come up it increases our sense that the wheel may be rigged towards red and therefore that the series is not truly random. He points out that the wheel could equally be rigged to throw up RBRRB, but that is too nuanced a theory for us to entertain. He writes, “our priors are not flat, but spiky. We assign a lot of mental weight to a few theories, while others, like the RBRRB theory, get assigned a probability almost indistinguishable from zero.”

The data informing priors can come from a range of sources. They can come from historic precedent, such as the number of days historically in which the sun has risen; they can come from a knowledge of the domain, such as popularity of different degrees taken at university. Or they can be completely subjective, a guess. It was such subjectivity that made Bayes controversial in some quarters. Practitioners in some branches of statistics viewed any element of subjectivity as anathema to the scientific process.

The richer the prior information we have, the more useful the information we can get out of Bayes. But even if we have nothing, we usually know something about the distribution in which the evidence lies. We know that the lifespan of dogs follows a normal distribution. So we can make a fairly good stab at how many more years a 1 year old dog will survive, and how many more years a 10 year old will survive. We know that movie box office takings follow a power law distribution. So if a movie takes in US$100m on its opening weekend, we can assume it will go on to take some multiple of that over its run. An innate understanding of whether the evidence forms a normal distribution, a power distribution, or some other distribution allows us implicitly to use Bayes to come up with what can be fairly accurate guesses about things.

Navigating between hypotheses

Bayes’ simplicity is evaded by our natural tendency not to weight priors (or observations) appropriately. But there is another aspect to Bayes that elevates it from a simple statement of intuition. That’s when we are trying to consider more than one hypothesis simultaneously.

Several years after the publication of Bayes’ essay, French mathematician Pierre-Simon Laplace refined Bayes formulation into the mathematical construct we utilise today. The formula is an algebraic expression that forges three known variables and one unknown one.

That formula helps to solve many problems. In his book The Signal and The Noise, Nate Silver uses examples of terror attacks, breast cancer tests and being cheated on to illustrate how Bayes can be applied. Kahneman uses an example involving a green or blue taxi in a hit-and-run accident. Jordan Ellenberg looks at a Facebook algorithm for detecting terrorists.

Let’s take the breast cancer example. Assume we live in a country with 10 million women in their forties. The chance that a woman will develop breast cancer in her forties is 1.4%. Mammograms provide a good test of breast cancer. They detect it 75% of the time. But they also give out false positives, i.e. they say a woman has breast cancer when she actually doesn’t, about 10% of the time. At first, we might just focus on the result of the mammogram and assume that if it comes out positive, it’s bad news. But just like in the Tom W case above, it’s also important to look at the priors. More important because here we have an alternative hypothesis through the false positive.

Suppose all the women take the mammogram test. We know that 1.4% of the women have breast cancer so that’s 140,000 women. The test will correctly show up 75% of those, or 105,000 as positive. If 1.4% of the women have breast cancer, it means that 98.6% don’t, so 9.86 million women don’t have breast cancer. But because the false positive rate of the test is 10%, 986,000 of these women will falsely screen as having breast cancer. So, the test will throw up over a million cases (986,000 plus 105,000) of women screening for breast cancer, although of those only 105,000 will actually have the disease. Which means that even if a woman screens positive there’s a very high likelihood (~90%) that she doesn’t have breast cancer, and is falsely alarmed.

There are two different questions to be asked here. One is: What’s the chance that a woman tests positive on a mammogram, given that she doesn’t have breast cancer? The other is: What’s the chance that she doesn’t have breast cancer, given that she tests positive? They sound the same, but they’re not. The answer to the first is the false positive rate – 10%. The answer to the second, as we saw above, is around 90%. Unfortunately, many doctors don’t know the difference. In his book Risk Savvy, Gerd Gigerenzer writes that when he presented the problem to a group of doctors at a continuing medical education session, 80% of them got it wrong.

The danger of confusing these two questions can become apparent in a court of law. This is the issue that lies at the heart of the prosecutor’s fallacy. The court may be tempted to ponder the question ‘How likely would an innocent person be to look guilty?’. But what they should be pondering is ‘How likely is this guilty-looking defendant to be innocent?’. In the case of Sally Clark answering the wrong question had tragic consequences.

The ultimate learning algorithm

Bayes has multiple uses across a range of fields. It was used to defend Alfred Dreyfus on appeal, it was used by Alan Turing to crack the Enigma code and it has been used to search for shipwrecks. In the world around us, it drives how spam filters work, how self-driving cars behave and how search engines work. It is used in astronomy, in genetics, and in medicine.

Bayes is also a great algorithm for learning. It’s a system for turning data into knowledge. The rise of big data and improved computing power has unleashed its capabilities bringing it into use more and more. From a personal perspective I like it for that meta-quality of being about learning. It’s a theorem I found out about many years ago. I learned its name and applied it in maths problems, but it took many years really to internalise it. Like in business where execution is more important than the idea, in learning internalising an idea is more important than simply knowing it.


Several years after that interview in Oxford I found myself in another interview, this time for a graduate job on the derivatives desk of a major bank. The interviewer posed me a question, which I later found out was the Monty Hall problem. Three doors, there’s a goat behind two of them and a car behind one. I choose a door, the host than opens one of the other doors to reveal a goat. Question: do I stick with the door I’ve chosen, or do I switch? Had Bayes been my favourite theorem, the answer would have been obvious.


Sources and Bibliography

  • The Theory That Would Not Die, Sharon Bertsch McGrayne
  • How Not To Be Wrong, Jordan Ellenberg
  • Thinking, Fast and Slow, Daniel Kahneman
  • The Signal and the Noise, Nate Silver
  • Algorithms to Live By, Brian Christian and Tom Griffiths
  • Superforecasting, Philip Tetlock
  • Ten Great Ideas About Chance, Persi Diaconis and Brian Skyrms
  • The Master Algorithm, Pedro Domingos
  • Risk Savvy, Gerd Gigerenzer
  • Picture credit: M Telfer, sciencenews.org


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s