The mathematical foundations of probability

Tivadar Danka small portrait Tivadar Danka
Gaussian density functions

Understanding math will make you a better engineer.

So, I am writing the best and most comprehensive book about it.

Abstractions are there to hide irrelevant things and focus only on the essential details. Although it may seem scary sometimes, it is the best tool to manage complexity.

If you ask n\textstyle n mathematicians to define what mathematics is about, you'll probably get 2n2n different answers. My definition would be that it is the science of abstracting things away until only the core is left, providing the ultimate framework for reasoning about anything.

Have you ever thought about what probability really is? You have already used it to reason about data, do statistical analysis, or even build algorithms to do the reasoning for you by statistical learning. In this post, we go deep into the rabbit hole and explore the theory of probability with a magnifying glass.


To follow through, you don't need any advanced mathematics. I am focusing on explaining everything from the ground up. However, it is beneficial if you know the following:

  • Sets and set operations such as union, intersection, and difference.
  • Limits and some basic calculus.

Sets and measures

Probability can be heuristically thought of as a function measuring the likelihood of an event happening. Mathematically speaking, it is not yet clear at all what events and measures are. Before we can properly discuss probability, we need to make a solid footing first. So, let's start with events.


What is the probability that I roll an odd number with a dice?

This simple question comes into our mind as an example when talking about probabilities. In this simple question, the event is rolling an odd number. To model this mathematically, we use sets. The "universe" - the base set containing the outcomes of this experiment - is simply Ω=1,2,3,4,5,6\Omega = {1, 2, 3, 4, 5, 6} and an event is a subset of Ω\Omega. Here, rolling an odd number corresponds to the subset A=1,3,5A = {1, 3, 5}.

So, to define probability, you need an underlying set Ω\Omega and a collection of its subsets Σ\Sigma, which we will call events. However, Ω\Omega cannot just be any collection of subsets. There are three conditions that must be met.

  1. Ω\Omega is an event.
  2. If X\textstyle X is an event, then its complement ΩX\Omega \setminus X is also an event. That is, an event not happening is another event as well.
  3. The union of events is an event as well. In other words, Σ\Sigma is closed to the union.

If these are satisfied, Σ\Sigma is called a σ-algebra. In proper mathematical terminology:

1)ΩΣ,2)XΩ    ΩXΣ,3)X1,X2,Σ    n=1XnΣ.\begin{aligned} &1) \quad \Omega \in \Sigma, \\ &2) \quad X \in \Omega \implies \Omega \setminus X \in \Sigma, \\ &3) \quad X_1, X_2, \dots \in \Sigma \implies \cup_{n=1}^{\infty} X_n \in \Sigma. \end{aligned}

In our case, we have Ω=1,2,3,4,5,6,Σ=2Ω.\Omega = {1, 2, 3, 4, 5, 6}, \quad \Sigma = 2^\Omega.

A more interesting case arises when Ω\Omega is the set of real numbers. Later we'll see that if all subsets of the real numbers are considered events, then strange things can happen.

Describing σ-algebras

These event spaces, which we define with σ-algebras, can be hard to describe. One can instantly see that to have a meaningful event space on a nontrivial base set Ω\Omega, we should have an infinite number of events. For instance, we are shooting bullets on a board and calculating the probability of hitting a specific region. In these cases, it is enough to specify some subsets and take the smallest σ-algebra containing these.

Let's suppose we are shooting at a rectangular board. If we say that our event space is the smallest σ-algebra containing all rectangle subsets of the board, we

  1. have a straightforward description of the σ-algebra,
  2. containing all kinds of shapes. (Since σ-algebras are closed under union.)

A lot of sets can be described as the infinite union of rectangles, as we see below.

an arbitrary set as the union of rectangles

We call the set of rectangles inside the board the generating set, while the smallest σ-algebra is the generated σ-algebra.

Ω=[0,1]×[0,1],D=(a1,b1]×(a2,b2]:0ai,bi1,Σ=σ(D)\begin{aligned} \Omega &= [0, 1] \times [0, 1], \\ \mathcal{D} &= {(a_1, b_1] \times (a_2, b_2]: 0 \leq a_i, b_i \leq 1 }, \\ \Sigma &= \sigma(\mathcal{D}) \end{aligned}

You can think about this generating process as taking all elements of your generating set and taking unions and complements in all the possible ways.

Now that we have a mathematical framework to work with events, we shall focus on measures.


Although intuitively measuring something is clear, this is a challenging thing to formalize properly. A measure is a function, mapping sets to numbers. To consider a simple example, measuring the volume of a three-dimensional object seems simple enough, but even here, we have serious problems. Can you think of an object in the space for which you cannot measure the area?

Probably you can't right away, but this is not the case. We can show that if every subset of the space has a well-defined volume, you can take a sphere of unit volume, cut it up into several pieces, and put together two spheres of unit volume.

the Banach-Tarski paradox The Banach-Tarski paradox. Source: Wikipedia

This is called the Banach-Tarski paradox. Since you cannot do this, it follows that you cannot measure the volume of every subset in space.

But in this case, what are measures anyway? We only have three requirements:

  • the measure of any set should always be positive,
  • the measure of the empty set should be zero,
  • and if you sum up the measures of disjoint sets, you get the measure of their union.

To define them properly, we need a base set Ω\Omega and a Σ\Sigma σ-algebra of subsets. The function

μ:Σ[0,)\mu: \Sigma \to [0, \infty)

is a measure if

1)μ(E)0 for all EΣ,2)μ()=0,3)μ(n=1En)=n=1μ(En) if EnΣ are disjoint sets.\begin{aligned} &1) \quad \mu(E) \geq 0 \text{ for all } E \in \Sigma, \\ &2) \quad \mu(\emptyset) = 0, \\ &3) \quad \mu(\cup_{n=1}^{\infty} E_n) = \sum_{n=1}^{\infty} \mu(E_n) \text{ if } E_n \in \Sigma \text{ are disjoint sets}. \end{aligned}

Property 3. is called σ-additivity. If we only have a finite number of sets, we will simply refer to it as the additivity of the measure.

This definition is simply the abstraction of the volume measure. It might seem strange, but these three properties are all that matters. Everything else follows from them. For instance, we have

μ(AB)=μ(A)μ(B),BA,\mu(A \setminus B) = \mu(A) - \mu(B), \quad B \subseteq A,

which follows from the fact that ABA \setminus B and B\textstyle B are disjoint, and their union is A\textstyle A.

Another important property is the continuity of measures. This says that

1)μ(k=1Ek)=limnμ(k=1nEk)if EnEn+1,2)μ(k=1Ek)=limnμ(k=1nEk)if EnEn+1.\begin{aligned} &1) \quad \mu(\cup_{k=1}^{\infty} E_k) = \lim_{n \to \infty} \mu(\cup_{k=1}^{n} E_k) \quad \text{if } E_n \subseteq E_{n + 1}, \\ &2) \quad \mu(\cap_{k=1}^{\infty} E_k) = \lim_{n \to \infty} \mu(\cap_{k=1}^{n} E_k) \quad \text{if } E_n \supseteq E_{n + 1}. \end{aligned}

Describing measures

As we have seen for σ-algebras, you only have to give a generating set instead of a full σ-algebra. This is very useful for us when working with measures. Although measures are defined on σ-algebras, it is enough to define them on a generating subset. Because of the σ-additivity, it determines the measure on every element of the σ-algebra.

The definition of probability

Now everything is set to define probability mathematically. A probability space is defined by the tuple (Ω,Σ,P)(\Omega, \Sigma, P), where Ω\Omega is the base set, Σ\Sigma is a σ-algebra of its subsets, and P\textstyle P is a measure such that P(Ω)=1P(\Omega) = 1.

So, probability is strongly related to quantities like area and volume. Area, volume, and probability are all measures in their own spaces. However, this is quite an abstract concept, so let's give a few examples.

Coin tossing

The event of coin tossing describes the simplest probability space. Say if we code heads with 0 and tails with 1, we have

Ω=0,1,Σ=,0,1,0,1,P(0)=P(1)=12.\begin{aligned} \Omega &= { 0, 1 }, \\ \Sigma &= { \emptyset, {0}, {1}, {0, 1} }, \\ P({0}) &= P({1}) = \frac{1}{2}. \end{aligned}

Due to the properties of the σ-algebra and the measure, you only need to define the probability for the event 0{0} (heads) and the event 1{1} (tails), this determines the probability measure entirely.

Random numbers

A more interesting example is connected to random number generation. If you are familiar with Python, you have probably used the random.random() function, which gives you a random number between 0 and 1. Although this might seem mysterious, it is pretty simple to describe it with a probability space.

Ω=[0,1],Σ=σ((a,b]:0a,b1),P((a,b])=ba.\begin{aligned} \Omega &= [0, 1], \\ \Sigma &= \sigma({ (a, b]: 0 \leq a, b \leq 1 }), \\ P((a, b]) &= b - a. \end{aligned}

Again, notice that it is enough to give the probabilities on the elements of the generating set. For example, we have

P((0,0.2](0.7,1])=P((0,0.2])+P((0.7,1])=0.5.\begin{aligned} P((0, 0.2] \cup (0.7, 1]) &= P((0, 0.2]) + P((0.7, 1]) \\ &= 0.5. \end{aligned}

To see a more complicated example, what is P(0.5)P({0.5})? How can we calculate the probability of picking 0.5? (Or any other number between zero and one.) For this, we need to rely on the properties of measures. We have

0P(0.5)leqP(0.5)+P((0.5ε,0.5))=P((0.5ε,0.5]).\begin{aligned} 0 \leq P({0.5}) &leq P({0.5}) + P((0.5 - \varepsilon, 0.5)) \\ &= P((0.5 - \varepsilon, 0.5]). \end{aligned}

which holds for all ε>0\varepsilon > 0. Here, we have used the additivity of the probability measure. Thus, it follows that

0P(0.5)ε.0 \leq P({0.5}) \leq \varepsilon.

Again, since it holds for all ε>0\varepsilon > 0, this means that the probability is smaller than any positive real number, so it must be zero.

A similar argument follows for any 0x10 \leq x \leq 1. It might be surprising to see that picking a particular number has zero probability. So, after you have generated the random number and observed the result, know that it had exactly 0 probability of happening. Yet, you still have the result right in front of you. So, events with zero probability can happen.

Distributions and densities

We have gone a long way. Still, working with measures and σ-algebras is not very convenient from a practical standpoint. Fortunately, this is not the only way of working with probabilities.

For simplicity, let's suppose that our base set is the set of real numbers. Specifically, we have the probability space (Ω,Σ,P)(\Omega, \Sigma, P), where

Ω=R,Σ=σ((a,b]:a,bR),\begin{align*} \Omega &= \mathbb{R}, \\ \Sigma &= \sigma({ (a, b]: a, b \in \mathbb{R} }), \end{align*}

and P\textstyle P is any probability measure on this space. We have seen before that the probabilities of the events (a,b](a, b] determine the probability for the rest of the events in the event space. However, we can compress this information even further. The function

F(x)=P((,x]),xRF(x) = P((-\infty, x]), \quad x \in \mathbb{R}

contains all information we have to know about the probability measure. Think about it: we have

P((a,b])=P((,b])P((,a])=F(b)F(a)\begin{aligned} P((a, b]) &= P((-\infty, b]) - P((-\infty, a]) \\ &= F(b) - F(a) \end{aligned}

for all a\textstyle a and b\textstyle b. This is called the distribution function of P\textstyle P. For all probability measures, the distribution function satisfies the following properties:

1)F(x)0 for all xR,2)F(x1)F(x2) for all x1x2,3)limxF(x) and limxF(x)=1,4)limx,x<x0F(x)=F(x0).\begin{aligned} &1) \quad F(x) \geq 0 \text{ for all } x \in \mathbb{R}, \\ &2) \quad F(x_1) \leq F(x_2) \text{ for all } x_1 \leq x_2, \\ &3) \quad \lim_{x \to - \infty} F(x) \text{ and } \lim_{x \to \infty} F(x) = 1, \\ &4) \quad \lim_{x \to \infty, x< x_0} F(x) = F(x_0). \end{aligned}

(The 4th one is called left continuity. Don't stress if you are not familiar with the definition of continuity. It is not essential now.)

Again, if this is too abstract, let's consider an example. For the previous example of random number generation, we have

F(x)={0if x<0,xif 0x1,1if x>1. F(x) = \begin{cases} 0 & \text{if } x < 0, \\ x & \text{if } 0 \leq x \leq 1, \\ 1 & \text{if } x > 1. \end{cases}

This is called the uniform distribution on [0,1][0, 1].

the uniform distribution

To summarize, if you give me a probability measure, I'll give you a distribution function describing the probability measure. However, this is not the best about distribution functions. From a mathematical perspective, it is also true that if you give a function satisfying the properties 1)–4) above, I can construct a probability measure from it. Moreover, if two distribution functions are equal everywhere, then their corresponding probability measures are also identical. So, from a mathematical perspective, distribution functions and probability measures are equivalent. This is extremely useful for us.

Density functions

As we have seen, a distribution function takes all information from a probability measure and essentially compresses it. It is a great tool, but sometimes it is not convenient. For instance, calculating expected values is hard when we only have a distribution function. (Don't worry if you don't know what is expected value, we won't use it right now.)

For many practical purposes, we describe probability measures with density functions. A function f:RRf: \mathbb{R} \to \mathbb{R} is the density function for the probability measure P if

P(E)=Ef(x)dx,EΣP(E) = \int_{E} f(x) dx, \quad E \in \Sigma

holds for all E\textstyle E in the σ-algebra Σ\Sigma. That is, heuristically, the probability for a given set is determined by the area under the curve of f(x)f(x). This definition might seem simple, but there are many details hidden here, which I won't go into. For instance, it is not trivial how to integrate a function over an arbitrary set E\textstyle E.

You are probably familiar with the famous Newton-Leibniz rule from calculus. Here, this says

abf(x)dx=P((a,b])=F(b)F(a),\begin{aligned} \int_{a}^{b} f(x)dx &= P((a, b]) \\ &= F(b) - F(a), \end{aligned}

which implies that if the distribution function is differentiable, its derivative is the density function.

There are certain probability distributions for which only the density function is known in closed form. (Having a closed form means expressing it with a finite number of standard operations and elementary functions.) One of the most famous distributions is like this: the Gaussian. It is defined by

f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}

where μ\mu and σ\sigma are its parameters.

Probability density function of the Gaussian distribution. Source: Wikipedia ( Probability density function of the Gaussian distribution. Source: Wikipedia

Probability distribution function of the Gaussian distribution. Source: Wikipedia ( Probability distribution function of the Gaussian distribution. Source: Wikipedia

However surprising it may seem, we can't express the distribution function of the Gaussian in closed form. It is not that mathematicians just haven't figured out. It is proven to be impossible. (Proving that something is impossible to do in mathematics is usually extremely hard.)

Where to go from here?

What we have seen so far is only the tip of the iceberg. (Come to think of it, this can be said at the end of every discussion about mathematics.) Here, we have only defined what is probability in a mathematically (semi-)precise way.

The truly interesting stuff, like machine learning, is still before us.

If you would like to start, I wrote a detailed article on how we can formulate machine learning in terms of probability theory. Check it out!

Having a deep understanding of math will make you a better engineer.

I want to help you with this, so I am writing a comprehensive book that takes you from high school math to the advanced stuff.
Join me on this journey and let's do this together!