What You Should Know About Differentiation

Tivadar Danka small portrait Tivadar Danka
Surface given by a function of two variables

Understanding math will make you a better engineer.

So, I am writing the best and most comprehensive book about it.

In the history of science, few milestones are as significant as inventing the wheel. Even among these, differentiation is a highlight: with calculus, Newton essentially created modern mechanics as we know it. Differentiation enables space travel, function optimization, or even epidemiological models. In machine learning, derivatives are key for training deep neural networks.

However, its importance is not obvious from the mathematical definition: if f:RR f: \mathbb{R} \to \mathbb{R} is an arbitrary function, it is said to be differentiable at x x if the limit

f(x)=limyxf(x)f(y)xy f^\prime(x) = \lim_{y \to x} \frac{f(x) - f(y)}{x - y}

exists, which is called its derivative. This definition is well known but loaded with concepts that are often left unexplained. In this post, our goal is to understand what the derivative really is, how to extend the it to multiple variables, and how it allows us to build models of the world around us. Let's get to it!

Differentiation as the rate of change

Instead of jumping back straight into the mathematical definition, let's start our discussion with a straightforward example: a point-like object moving along a straight line. The straight line can be modelled with the real numbers R \mathbb{R} , so it makes sense to describe the motion of our object with the function f(t):RR f(t): \mathbb{R} \to \mathbb{R} , mapping a point in time t t to a position f(t) f(t) . Something like this below.

The time-distance graph of a moving object

Our goal is to calculate the object's speed at a given time. In high school, we learned that

average speed=distancetime. \text{average speed} = \frac{\text{distance}}{\text{time}}.

To put this into a quantitative form, if t1<t2 t_1 < t_2 are two arbitrary points in time, then

average speed between t1 and t2=f(t2)f(t1)t2t1. \text{average speed between } t_1 \text{ and } t_2 = \frac{f(t_2) - f(t_1)}{t_2 - t_1}.

Expressions like f(t2)f(t1)t2t1 \frac{f(t_2) - f(t_1)}{t_2 - t_1} are called differential quotients. Note that if the object moves backwards, the average speed is negative.

The average speed has a simple geometric interpretation. If you replace the object's motion with a constant velocity motion moving at its average speed, you'll end up at the same place. In graphical terms, this is equivalent of connecting (t1,f(t1)) (t_1, f(t_1)) and (t2,f(t2)) (t_2, f(t_2)) with a single line. The average speed is just the slope of this line, as you can see below.

Average speed of a moving object, visualized in the time-distance plot Given this, we can calculate the exact speed at a single time point t t , which we'll denote with v(t) v(t) . (v v is short for velocity.) The idea is simple: the average speed in the small time-interval between t t and t+Δt t + \Delta t should get closer and closer to v(t) v(t) if Δt \Delta t is small enough. (Δt \Delta t can be negative as well.)


v(t)=limΔt0f(t+Δt)f(t)Δt, v(t) = \lim_{\Delta t \to 0} \frac{f(t + \Delta t) - f(t)}{\Delta t},

if the above limit exists.

Following our geometric intuition, we can notice that v(t) v(t) is simply the slope of the tangent line of f f at t t . This can be beautifully illustrated when we visualize the ratio f(t+Δt)f(t)Δt \frac{f(t + \Delta t) - f(t)}{\Delta t} for a few Δt \Delta t -s.

The derivative as the tangent line Keeping this in mind, we are ready to introduce the formal definition. (The one that we mentioned earlier, but it actually makes sense this time.)

Definition. (Differentiability.) Let f:RR f: \mathbb{R} \to \mathbb{R} be an arbitrary function. We say that f f is differentiable at x0R x_0 \in \mathbb{R} if the limit

dfdx(x0)=limxx0f(x0)f(x)x0x \frac{df}{dx}(x_0) = \lim_{x \to x_0} \frac{f(x_0) - f(x)}{x_0 - x}

exists. If so, dfdx(x0) \frac{df}{dx}(x_0) is called the derivative of f f at x0 x_0 .

In pure English, if f f describes a time-distance function of a moving object, then the derivative is simply its speed. In other words, the derivative quantifies the rate of change. Note that differentiability is a property of the function f f and the point x0 x_0 . As we shall see later, some functions are differentiable at some points but not differentiable at others.

Don't let the change in notation from t t and t+Δt t + \Delta t to x0 x_0 and x x confuse you, this means exactly the same as before. Speaking of confusion, sometimes, the multiple notations for differentiation can be difficult to interpret. For instance, x x can denote the variable of f f and the exact point where the derivative is taken. To clear this up, here is a quick glossary of terms to clarify the difference between derivative and derivative function.

  • dfdx(x0) \frac{df}{dx}(x_0) : derivative of f f with respect to the variable x x at the point x0 x_0 . This is a scalar, also denoted with f(x0) f^\prime(x_0) .
  • dfdx \frac{df}{dx} : derivative function of f f with respect to the variable x x . This is a function, also denoted with f f^\prime .

Differentiability is smoothness

Now that we understand how derivatives express the rate of change, we'll look at things from a more abstract viewpoint: what does differentiability mean? Mind you, I am not talking about the value of the derivative itself but the fact that it exists. To make my point clear, let's consider two examples.

Example 1. f(x)=x2 f(x) = x^2 . Here, we have

limyxf(x)f(y)xy=limyxx2y2xy=limyx(xy)(x+y)xy=limyxx+y=2x. \begin{align*} \lim_{y \to x} \frac{f(x) - f(y)}{x - y} &= \lim_{y \to x} \frac{x^2 - y^2}{x - y} \\ &= \lim_{y \to x} \frac{(x - y)(x + y)}{x - y} \\ &= \lim_{y \to x} x + y \\ &= 2x. \end{align*}

So, f(x)=x2 f(x) = x^2 is differentiable everywhere and f(x)=2x f^\prime(x) = 2x . No surprise here. If you are a visual person, this is how the tangents look.

Derivatives of the square function

The graph of the x2 x^2 function is smooth everywhere. However, this is not always the case, leading us to the second example.

Example 2. f(x)=x f(x) = |x| at x=0 x = 0 . For this, we have

limy0f(0)f(y)0y=limy0yy. \begin{align*} \lim_{y \to 0} \frac{f(0) - f(y)}{0 - y} &= \lim_{y \to 0} \frac{|y|}{y}. \end{align*}


yy={1if y>0,1if y<0, \frac{|y|}{y} = \begin{cases} 1 & \text{if } y > 0, \\ -1 & \text{if } y < 0, \end{cases}

this limit does not exist. Thus, x |x| is not differentiable at 0 0 .

It is worth drawing a picture here to enhance our understanding of differentiability. Recall that the value of the derivative at a given point equals the slope of the tangent line to the function's graph. Since x |x| has a sharp corner at 0 0 , the tangent line is not well-defined, as multiple possibilities exist.

The non-differentiability of the absolute value function In other words, differentiability means no sharp corners in the graph. This is why differentiable functions are often called smooth.

From this perspective, differentiability means manageable behavior: no wrinkles, corners, or sharp changes in value. Next, we'll see an equivalent definition of differentiability involving local approximation with a linear function.

Differentiation as the best local linear approximation

Do you recall how we introduced the definition of the derivative? Essentially, we approximated the dynamics of a moving point-like object with a constant velocity motion on smaller and smaller time intervals, eventually shrinking down the gap to zero. From the perspective of mechanics, differentiation is the same as swapping the motion with a constant velocity one in a given instant.

We can make this idea mathematically precise with the following theorem. (Yes, a theorem. Don't be scared. Theorems and proofs are just crystallized forms of logically correct statements.)

Theorem. (Differentiation as a local linear approximation.) Let f:RR f: \mathbb{R} \to \mathbb{R} be an arbitrary function. The following are equivalent.

(a) f f is differentiable at x0 x_0 .

(b) there is an α \alpha such that

f(x)=f(x0)+α(xx0)+o(xx0) f(x) = f(x_0) + \alpha (x - x_0) + o(|x - x_0|)

holds as xx0 x \to x_0 .

(Recall that the small O notation means that the function is an order of magnitude smaller around x0 x_0 than the function xx0 |x - x_0| . That is,

limxx0o(xx0)xx0=0. \lim_{x \to x_0} \frac{o(|x - x_0|)}{|x - x_0|} = 0.

The α \alpha in the above theorem is going to be the derivative f(x0) f^\prime(x_0) . In other words, f(x) f(x) is locally approximated with the linear function f(x0)+f(x0)(xx0) f(x_0) + f^\prime(x_0) (x - x_0) .)

Proof. To show the equivalence of two statements, we have to prove that differentiation implies the desired property and vice versa. Although this might seem complicated, it is straightforward and entirely depends on how functions can be written as their limit plus an error term.

(a)      \implies (b). The existence of the limit

limxx0f(x)f(x0)xx0=f(x0) \lim_{x \to x_0} \frac{f(x) - f(x_0)}{x - x_0} = f^\prime(x_0)

implies that we can write the slope of the approximating tangent in the form

f(x)f(x0)xx0=f(x0)+error(x), \frac{f(x) - f(x_0)}{x - x_0} = f^\prime(x_0) + \mathrm{error}(x),

where limxx0error(x)=0 \lim_{x \to x_0} \mathrm{error}(x) = 0 . With some simple algebra, we obtain

f(x)=f(x0)+f(x0)(xx0)+error(x)(xx0). f(x) = f(x_0) + f^\prime(x_0)(x - x_0) + \mathrm{error}(x)(x-x_0).

Since the error term tends to zero as x x goes to x0 x_0 , error(x)(xx0)=o(xx0) \mathrm{error}(x)(x-x_0) = o(|x - x_0|) , which is what we wanted to show.

(b)      \implies (a). Now, repeat what we did in the previous part, just in reverse order. We can rewrite

f(x)=f(x0)+α(xx0)+o(xx0) f(x) = f(x_0) + \alpha (x - x_0) + o(|x - x_0|)

in the form

f(x)f(x0)xx0=α+o(1), \frac{f(x) - f(x_0)}{x - x_0} = \alpha + o(1),

which, according to what we have used before, implies that

limxx0f(x)f(x0)xx0=α. \lim_{x \to x_0} \frac{f(x) - f(x_0)}{x - x_0} = \alpha.

So, f f is differentiable at x0 x_0 and its derivative is f(x0)=α f^\prime(x_0) = \alpha . \square

Notice that in the x x variable, the expression f(x0)+f(x0)(xx0) f(x_0) + f^\prime(x_0) (x - x_0) defines a linear function. In fact, this is the equation of the tangent line! The expression

f(x)=f(x0)+α(xx0)+o(xx0) f(x) = f(x_0) + \alpha (x - x_0) + o(|x - x_0|)

tells us that around x0 x_0 , f f equals a linear function plus a small error. You might ask, why is this good for us? For one, this form will work in higher dimensions, as opposed to the limit of differential quotients. Let's take a look!

Derivatives of multivariable functions

For a single variable function, we defined the derivative as the limit of difference quotients

limyxf(x)f(y)xy, \lim_{y \to x} \frac{f(x) - f(y)}{x - y},

where x x and y y are real numbers. For a multivariable function f:RnR f: \mathbb{R}^n \to \mathbb{R} , the difference quotients are not defined. Why? Because division with the vector xy x - y doesn't make sense.

To see what we can do here, let's build our intuition using functions of two variables. (That is, those that are defined on the plane.) In this case, the graph is a surface. For example,

f(x,y)=cos(3x+2y)+cos(2x+4y)2sin(x+y) f(x, y) = \cos(3x + 2y) + \cos(2x + 4y) - 2\sin(x + y)

looks like this below.

Surface of a function of two variables We immediately see that the concept of the tangent line is not well defined since we have many tangent lines to a given point on the surface. In fact, we have a whole plane of them, but more on those later. This is called the tangent plane.

The tangent plane However, this tangent plane contains two special directions. Suppose we are looking at the tangent plane at (0,0)(0, 0). For every multivariable function, fixing all but one variable is a function of a single variable. In our case, we would have

f(x,0)=cos(3x)+cos(2x)2sin(x)f(x, 0) = \cos(3x) + \cos(2x) - 2\sin(x)


f(0,y)=cos(2y)+cos(4y)2sin(y).f(0, y) = \cos(2y) + \cos(4y) - 2\sin(y).

We can visualize these functions by slicing the surface with a vertical plane perpendicular to one of the axes. Where the plane and the surface meet is the graph of f(x,0)f(x, 0) or f(0,y)f(0, y), depending on which plane you use. This is how it looks.

Direction of the partial derivative We can define the derivatives as we have done in the previous section for these functions. These are called partial derivatives and they play an essential role in generalizing our peak finding algorithm. To formalize it mathematically, they are defined by

f(x,y)x=fx(x,y)=limx0xf(x,y)f(x0,y)xx0,f(x,y)y=fy(x,y)=limy0yf(x,y)f(x,y0)yy0.\begin{align*} \frac{\partial f(x, y)}{\partial x} &= f_x(x, y) = \lim_{x_0 \to x} \frac{f(x, y) - f(x_0, y)}{x - x_0}, \\ \frac{\partial f(x, y)}{\partial y} &= f_y(x, y) = \lim_{y_0 \to y} \frac{f(x, y) - f(x, y_0)}{y - y_0}. \end{align*}

The values of partial derivatives are the slopes of the tangent plane in the direction parallel with x x or the y y axis. The direction of the steepest ascent is given by the gradient, defined by

f(x,y)=(f(x,y)x,f(x,y)y).\nabla f(x, y) = \bigg( \frac{\partial f(x, y)}{\partial x}, \frac{\partial f(x, y)}{\partial y} \bigg).

(If you are familiar with the famous gradient descent optimization algorithm, this is why the gradient determines the direction of the step.)

Differentiation as a local linear approximation, revisited

So, instead of having a derivative, we have one for each variable. Can we find a pattern that meaningfully relates all of these partial derivatives to each other? Yes, and this is where the already familiar linear approximations come into the picture. Recall that for a differentiable univariate function f:RR f: \mathbb{R} \to \mathbb{R} , we have

f(x)=f(x0)+α(xx0)+o(xx0), f(x) = f(x_0) + \alpha (x - x_0) + o(|x - x_0|),

and this is going to be the key to defining the analogue of differentiability.

Definition. (Differentiability in multiple variables.) Let f:RnR f: \mathbb{R}^n \to \mathbb{R} be an arbitrary multivariable function. f f is differentiable at x0Rn x_0 \in \mathbb{R}^n if there exists a f(x0)Rn \nabla f(x_0) \in \mathbb{R}^n such that

f(x)=f(x0)+f(x0)(xx0)+o(xx0) f(x) = f(x_0) + \nabla f(x_0) \cdot (x - x_0) + o(| x - x_0 |)

holds, where xy x \cdot y denotes the dot product of the vectors x,yRn x, y \in \mathbb{R}^n . f(x0) \nabla f(x_0) is called the gradient of f f at x0 x_0 .

This example shows the importance of looking at mathematical objects from several different directions. Sometimes, an alternate viewpoint can help to extend the scope of definitions significantly. Just like differentiation and the best linear approximation.

Having a deep understanding of math will make you a better engineer.

I want to help you with this, so I am writing a comprehensive book that takes you from high school math to the advanced stuff.
Join me on this journey and let's do this together!