The Roadmap of Mathematics for Machine Learning
Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state-of-the-art.
However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem challenging since it is built on top of multivariable calculus and probability theory.
If you are a beginner and don't necessarily have formal education in higher mathematics, creating a curriculum for yourself is hard. With proper foundations, though, most ideas can be seen as quite natural. This post aims to present a roadmap, taking you from absolute zero to a deep understanding of how neural networks work.
To keep things simple, the aim is not to cover everything. Instead, we will focus on getting our directions. This way, you will be able to study other topics without difficulties, if need be.
Instead of reading through it in one sitting, I recommend using this article as a reference point through your studies. Go deep into a concept that is introduced, then check the roadmap and move on. I firmly believe that this is the best way to study: I will show you the road, but you must walk it.
Most of machine learning is built upon three pillars: linear algebra, calculus, and probability theory. Since the last one builds on the first two, we should start with them. Calculus and linear algebra can be studied independently, as is usually the case in a standard curriculum.
Calculus is the study of differentiation and integration of functions. Essentially, a neural network is a differentiable function, so calculus will be a fundamental tool to train neural networks, as we will see.
To familiarize yourself with the concepts, you should make things simple and study functions of a single variable for the first time. By definition, the derivative of a function is defined by
where the ratio for a given h is the slope of the line between the points (x, f(x)) and (x+h, f(x+h)).
This is essentially the slope of the tangent line at the point x in the limit. The figure below illustrates the concept.
We can use differentiation to optimize functions: the derivative is zero at local maxima or minima. (However, this is not true in the other direction; see f(x) = x³ at 0.) Points where the derivative is zero are called critical points. Whether a critical point is minima or maxima can be decided by looking at the second derivative:
There are several essential rules regarding differentiation, but probably the most important is the so-called chain rule:
that tells us how to calculate the derivative of composed functions.
Integration is often called the inverse of differentiation. This is true because
which holds for any integrable function f(x). The integral of a function can also be thought of as the signed area under the curve. For instance,
because when the function is negative, the area there also has a negative sign.
Integration itself plays a role in understanding the concept of expected value. For instance, quantities like entropy and Kullback-Leibler divergence are defined in terms of integrals.
I would recommend the Single Variable Calculus course from MIT. (In general, online courses from MIT are always excellent learning resources.) If you are more of a book person, there are many textbooks available. The Calculus book by Gilbert Strang accompanying the previously mentioned course is again a great resource, completely free of charge.
As I mentioned, neural networks are essentially functions, which are trained using the tools of calculus. However, they are described with linear algebraic concepts like matrix multiplication.
Linear algebra is a vast subject with many essential aspects of machine learning, so this will be a significant segment.
To have a good understanding of linear algebra, I would suggest starting with vector spaces. It is better if we talk about a special case first. You can think of each point in the plane as a tuple
These are essentially vectors pointing from zero to (x₁, x₂). You can add these vectors together and multiply them with scalars:
This is the prototypical model of a vector space. In general, a set of vectors V is a vector space over the real numbers if you can add vectors together and multiply a vector with a real number, such that the following properties hold:
Don't panic! I know that this looks very scary (at least it looked that to me when I was a freshman student with a math major), but it isn't. These properties guarantee that vectors can be added together and scaled just as you would expect. When thinking about vector spaces, it helps if you mentally model them as the Euclidean plane.
If you feel like you have a good understanding of vector spaces, the next step would be to understand how to measure a vector's magnitude. By default, a vector space in itself gives no tools for this. How would you do this on the plane? You have probably learned that there, we have
This is a special case of a norm. In general, a vector space V is normed if there is a function
called norm such that
Again, this might be scary, but this is a simple and essential concept. There are a bunch of norms out there, but the most important is the p-norm family
(with p = 2 we obtain the special case mentioned above) and the supremum norm
Sometimes, like for p = 2, the norm comes from a so-called inner product, which is a bilinear function
A vector space with an inner product is called inner product space. An example is the classical Euclidean product
Every inner product can be turned into a norm by
When the inner product for two vectors is zero, we say that the vectors are orthogonal to each other. (Try to come up with some concrete examples on the plane to understand the concept deeper.)
Basis and orthogonal/orthonormal basis
Although vector spaces are infinite (in our case), you can find a finite set of vectors that can be used to express all vectors in the space. For example, on the plane, we have
This is a special case of basis and orthonormal basis. In general, a basis is a minimal set of vectors
such that their linear combinations span the vector space:
A basis always exists for any vector space. (It may not be a finite set, but that shouldn't concern us now.) Without a doubt, a basis simplifies things greatly when talking about linear spaces.
When the vectors in a basis are orthogonal to each other, we call it an orthogonal basis. If each basis vector's norm is 1 for an orthogonal basis, we say it is orthonormal.
The key objects related to vector spaces are linear transformations. If you have seen a neural network before, you know that one of the fundamental building blocks are layers of the form
where A is a matrix, b and x are vectors, and σ is the sigmoid function. (Or any activation function, really.) Well, the part Ax is a linear transformation. In general, the function
is a linear transformation between vector spaces V and W if
holds for all x, y in V, and all real number a.
To give a concrete example, rotations around the origin in the plane are linear transformations. Undoubtedly, the most crucial fact about linear transformations is that they can be represented with matrices, as you'll see next in your studies.
Matrices and their operations
If linear transformations are clear, you can turn to the study of matrices. (Linear algebra courses often start with matrices, but I would recommend it this way for reasons to be explained later.)
The most important operation for matrices is the matrix product. In general, if A and B are matrices defined by
then their product can be obtained by
This might seem difficult to comprehend, but it is pretty straightforward. Take a look at the figure below, demonstrating how to calculate the element in the 2nd row, 1 st column of the product.
The reason why matrix multiplication is defined the way it is because matrices represent linear transformations between vector spaces. Matrix multiplication is the composition of linear transformations.
In my opinion, determinants are hands down one of the most challenging concepts to grasp in linear algebra. Depending on your learning resource, it is usually defined by either a recursive definition or a sum that iterates through all permutations. None of them are tractable without significant experience in mathematics.
To understand this concept, watch this video. Trust me, it is magic.
To summarize, the determinant of a matrix describes how the volume of an object scales under the corresponding linear transformation. If the transformation changes orientations, the sign of the determinant is negative. You will eventually need to understand how to calculate the determinant, but I wouldn't worry about it now.
Eigenvalues, eigenvectors, and matrix decompositions
A standard first linear algebra course usually ends with eigenvalues/eigenvectors and some special matrix decompositions like the Singular Value Decomposition.
Let's suppose that we have a matrix A. The number λ is an eigenvalue of A if there is a vector x (called eigenvector) such that
holds. In other words, the linear transformation represented by A is a scaling by λ for the vector x. This concept plays an essential role in linear algebra. (And practically in every field which uses linear algebra extensively.)
At this point, you are ready to familiarize yourself with a few matrix decompositions. If you think about it for a second, what type of matrices are the best from a computational perspective? Diagonal matrices! If a linear transformation has a diagonal matrix, it is trivial to compute its value on an arbitrary vector.
Most special forms aim to decompose a matrix A to a product of matrices, where preferably at least one is diagonal. Singular Value Decomposition, or SVD in short, the most famous one, states that there are special matrices U, V, and a diagonal matrix Σ such that
holds. (U and V are so-called unitary matrices, which I don't define here, suffice to know that it is a special family of matrices.)
SVD is also used to perform Principal Component Analysis, one of the simplest and most well-known dimensionality reduction methods.
Linear algebra can be taught in many ways. The path I outlined here was inspired by the textbook Linear Algebra Done Right by Sheldon Axler. For an online lecture, I would recommend the Linear Algebra course from MIT OpenCourseWare, an excellent resource.
This is the part where linear algebra and calculus come together, laying the foundations for the primary tool to train neural networks: gradient descent. Mathematically speaking, a neural network is simply a function of multiple variables. (Although, the number of variables can be in the millions.)
Similar to univariable calculus, the two main topics here are differentiation and integration. Let's suppose that we have a function
mapping vectors to real numbers. In two dimensions (that is, for n = 2), you can imagine its plot as a surface. (Since humans don't see higher than three dimensions, it is hard to visualize functions with more than two real variables.)
Differentiation in multiple variables
In a single variable, the derivative was the slope of the tangent line. How would you define tangents here? A point on the surface has several tangents, not just one. However, there are two special tangents: one is parallel to the x-z plane, while the other is to the y-z plane. Their slope is determined by the partial derivatives, defined by
That is, you take the derivative of the functions obtained by fixing all but one variable. (The formal definition is identical for ≥ 3 variables, just more complicated notation.) Tangents in these special directions span the tangent plane.
There is another special direction: the gradient, which is the vector defined by
The gradient always points to the direction of the largest increase! So, if you would take a tiny step in this direction, your elevation would be the maximal among all the other paths you could have chosen. This is the basic idea of gradient descent, which is an algorithm to maximize functions. Its steps are the following.
- Calculate the gradient in the point x₀, where you currently are.
- Take a small step in the direction of the gradient to arrive at the point x₁. (The step size is called the learning rate.)
- Go back to Step 1. and repeat the process until convergence.
Of course, there are several flaws in this basic algorithm, which was improved several times over the years. Modern gradient descent-based optimizers employ many tricks like adaptive step size, momentum, and other methods, which we will not detail here.
Calculating the gradient in practice is difficult. Functions are often described by the composition of other functions, for instance, the familiar linear layer
where A is a matrix, b and x are vectors, and σ is the sigmoid function. (Of course, there can be are other activations, but we shall stick to this for simplicity.) How would you calculate this gradient? At this point, it is unclear how to define the gradient for vector-vector functions such as this, so let's discuss! A function
can always be written in terms of vector-scalar functions like
The gradient of g is defined by the matrix whose k-th row is the k-th component's gradient. That is,
This matrix is called the total derivative of g.
In our example
things become a bit more complicated because it is the composition of two functions:
defined by applying the univariate sigmoid componentwise. We can decompose the function l further to m functions mapping from the n-dimensional vector space to the space of real numbers:
If you calculate the total derivative, you'll see that
This is the chain rule for multivariate functions in its full generality. Without it, there would be no easy way to calculate the gradient of a neural network, which is ultimately a composition of many functions.
Similar to the univariate case, the gradient and the derivatives play a role in determining whether a given point in space is local minima or maxima. (Or neither.) To provide a concrete example, training a neural network is equivalent to minimizing the loss function on the parameters' training data. It is all about finding the optimal parameter configuration w for which the minimum is attained:
are the neural network and the loss function, respectively. For a general differentiable vector-scalar function of say n variables, there are n² second derivatives, forming the Hessian matrix
In multiple variables, the determinant of the Hessian takes the role of the second derivative. Similarly, we can use it to tell whether a critical point (that is, where all of the derivatives are zero) is minima, maxima, or just a saddle point.
There are a lot of awesome online courses on multivariable calculus. I have two specific recommendations:
Now we are ready to take on the final subject: probability theory!
Probability theory is the mathematically rigorous study of chance, fundamental to all fields of science.
Setting exact definitions aside, let's ponder a bit about what probability represents. Let's say I toss a coin, with a 50% chance (or 0.5 probability) of it being heads. After repeating the experiment 10 times, how many heads did I get?
If you have answered 5, you are wrong. Heads being 0.5 probability doesn't guarantee that every second throw is heads. Instead, it means that if you repeat the experiment n times where n is a very large number, the number of heads will be very close to n/2.
Besides the basics, there are some advanced things you need to understand, first and foremost, expected value and entropy.
Suppose that you play a game with your friend. You toss a classical six-sided dice, and if the outcome is 1 or 2, you win 300 bucks. Otherwise, you lose 200. What are your average earnings per round if you play this game long enough? Should you be even playing this game?
Well, you win 100 bucks with probability 1/3, and you lose 200 with probability 2/3. That is, if X is the random variable encoding the result of the dice throw, then
This is the expected value, that is, the average amount of money you will receive per round in the long run. Since this is negative, you will lose money, so you should never play this game.
Generally speaking, the expected value is defined by
for discrete random variables and
for real-valued continuous random variables.
In machine learning, loss functions for training neural networks are expected values in one way or another.
Law of large numbers
People often falsely attribute certain phenomena to the law of large numbers. For instance, gamblers on a losing streak believe that they should soon win because of the law of large numbers. This is wrong. Let's see what this is really!
are random variables representing the independent repetitions of the same experiment. (Say, rolling a dice or tossing a coin.)
Essentially, the law of large numbers states that
That is, the average of the outcomes in the long run is equal to the expected value.
An interpretation is that if a random event is repeated enough times, individual results might not matter. So, if you are playing in a casino with a game that has negative expected value (as they all do), it doesn't matter that you win occasionally. The law of large numbers implies that you will lose money.
To get a little bit ahead, LLN is going to be essential for stochastic gradient descent.
Let's play a game. I have thought of a number between 1 and 1024, and you have to guess it. You can ask questions, but your goal is to use as few questions as possible. How much do you need?
If you play it smart, you will perform a binary search with your questions. First, you may ask: is the number between 1 and 512? With this, you have cut the search space in half. Using this strategy, you can figure out the answer in
But what if I didn't use the uniform distribution when picking the number? For instance, I could have used a Poisson distribution. (Illustrated below, source: Wikipedia.)
Here, you would probably need fewer questions because you know that the distribution tends to concentrate around specific points. (Which depends on the parameter.)
In the extreme case, when the distribution is concentrated on a single number, you need zero questions to guess it correctly. Generally, the number of questions depends on the information carried by the distribution. The uniform distribution contains the least amount of information, while singular ones are pure information.
Entropy is a way to quantify this. It is defined by
for discrete random variables and
for continuous, real-valued ones. (The logarithm base is usually 2, e, or 10, but it doesn't really matter.)
If you have ever worked with classification models before, you probably encountered the cross-entropy loss, defined by
where P is the ground truth (a distribution concentrated to a single class), while the hatted version represents the class predictions. This measures how much "information" the predictions have compared to the ground truth. When the predictions match, the cross-entropy loss is zero.
Another frequently used quantity is the Kullback-Leibler divergence, defined by
where P and Q are two probability distributions. This is essentially cross-entropy minus the entropy, which can be thought of as quantifying how different the two distributions are. This quantity is useful, for instance, when training generative adversarial networks. Minimizing the Kullback-Leibler divergence guarantees that the two distributions are similar.
Here, I would recommend two books for you:
- Pattern Recognition and Machine Learning by Christopher Bishop,
- The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
These are the two fundamental textbooks, and they teach you much more than probability theory. Both of them go way beyond the basics, but the corresponding chapters provide an excellent introduction.
Beyond the mathematical foundations
With this, we reviewed the necessary mathematics for understanding neural networks. Now, you are ready for the fun part: machine learning! To understand how neural networks work, you still have to learn some optimization and mathematical statistics. These subjects build upon the foundations we set. I won't go into details since this is beyond the scope of the article, but I have prepared a study roadmap to guide you.