How the dot product measures similarity

Tivadar Danka small portrait Tivadar Danka
Cosine similarity

The dot product is one of the most fundamental concepts in machine learning, making appearances almost everywhere. In introductory linear algebra classes, we learn that in the vector space Rn \mathbb{R}^n , it is defined by the formula

x,y=i=1nxiyi,x=(x1,,xn),y=(y1,,yn).\langle x, y \rangle = \sum_{i=1}^{n} x_i y_i, \quad x = (x_1, \dots, x_n), \quad y = (y_1, \dots, y_n).

One of its most important applications is to measure similarity between feature vectors.

But how are similarity and inner product related? The definition doesn't reveal much. In this post, our goal is to unravel the dot product and provide a simple geometric explanation!

The fundamental properties of the dot product

To see what the dot product has to do with similarity, we have three key observations. First, we can see that it is linear in both variables. This property is called bilinearity:

ax+by,z=ax,z+by,zx,ay+bz=ax,y+bx,zx,y,zRn,a,bR. \begin{aligned} \langle ax + by, z \rangle &= a \langle x, z \rangle + b \langle y, z \rangle \\ \langle x, ay + bz \rangle &= a \langle x, y \rangle + b \langle x, z \rangle \\ x, y, z \in &\mathbb{R}^n, \quad a, b \in \mathbb{R}. \end{aligned}

Second, the dot product of orthogonal vectors is zero.

The definition of orthogonality Third, the dot product of a vector with itself equals the square of its magnitude:

x,x=k=1n(xi)2=x2.\langle x, x \rangle = \sum_{k=1}^{n} (x_i)^2 = | x |^2.

Armed with these, we are ready to explore how similarity is measured!

Dot product as similarity

Suppose that we have two vectors, x\textstyle x and y\textstyle y. To see the geometric interpretation of their dot product, we first note that x\textstyle x can be decomposed into the sum of two components: one is parallel to y\textstyle y, while the other is orthogonal.

Decomposition of vectors into orthogonal and parallel components

So, the dot product x,y \langle x, y \rangle equals to xy,y \langle x_y, y \rangle . If we write xyx_y as xy=ry x_y = ry , a scalar multiple of y\textstyle y, we can simplify the dot product:

x,y=xy,y=ry,y=ry2.\begin{align*} \langle x, y \rangle &= \langle x_y, y \rangle \\ &= \langle ry, y \rangle \\ &= r | y |^2. \end{align*}

We can go even one step further. If we assume that both x\textstyle x and y\textstyle y have a magnitude of one, the dot product equals the scaling factor!

x,y=xy,y=ry,y=r\begin{align*} \langle x, y \rangle &= \langle x_y, y \rangle \\ &= \langle ry, y \rangle \\ &= r \end{align*} Note that this scaling factor is in the interval [1,1][-1, 1]. It can be negative if the directions of xyx_y and y\textstyle y are the opposite.

Now comes the really interesting part! r\textstyle r has a simple geometric meaning. To see this, let's illustrate what is happening. (Recall we assumed that x\textstyle x and y\textstyle y both have a magnitude of one.)

Cosine similarity Since cosine is defined by the ratio of the adjacent side and the hypotenuse, it turns out that the scaling factor r\textstyle r also equals the cosine of the angle between x\textstyle x and y\textstyle y.

It is the reason why cosine similarity is defined this way:

cosα=xx,yy.\cos \alpha = \bigg\langle \frac{x}{|x|}, \frac{y}{|y|} \bigg\rangle.

I hope that this short post helps you make sense of this concept, and armed with this knowledge, you'll be more confident when dealing with it!

Understanding math is a superpower in machine learning.

I am writing a book about it to help you go from high school mathematics to neural networks.
Join me on this journey and let's do this together!