# Universality and orthogonal polynomials

There is a phenomenon in physics and mathematics which has been captivating the minds of scientists for a long time and it is called simply “universality”. In brief, a phenomenon exhibits universality if, be as wild and diverse in microscopic scale as possible, a clear pattern emerges if one looks at it from macrosopic scale. I first encountered this phenomena from a technical point of view while studying asymptotic behavior of orthogonal polynomials, and it got me too. In this post we will see some examples of systems exhibiting universal behavior and we shall see how do orthogonal polynomials enter the picture.

First let’s see some examples. I’ve drawn these from [Deift] and [Tao]. For more details and examples, check out these awesome articles.

Examples from physics and mathematics

1. Nontrivial zeros of the Riemann zeta function. The Riemann zeta function ($\zeta(z)$ from now on) is one of the most interesting objects of mathematics for almost two hundred years. The Riemann hypothesis says that if $z_0$ is a zero of $\zeta(z)$, then either $z_0$ is an even negative integer or $\Re(z_0) = \frac{1}{2}$. As most of you know, this hypothesis is one of the Millenium Price Problems stated by the Clay Institute, and many fine mathematicians attempted to solve this conjecture. Although it is an interesting topic which deserves an entire post, I won’t discuss it in detail. There are many books and articles about this, I recommend this one. (I recommend this even if you solved Riemann conjecture and kept it as a secret.) For now, let’s just study the zeros on the line $\frac{1}{2} + iy$. (This line is called the critical line.) The plot of the absolute value of the Riemann zeta function in the critical line

What we can observe here is that the zeros of $\zeta(z)$ in the critical line are not too far nor too close to each other. No clustering and no large gaps can be seen here. As it turns out, in some sense, it is true in general. The number theorist Hugh Montgomery discovered that if $0 \leq z_1 \leq z_2 \leq \dots$ denotes the imaginary part of the zeros on the critical line, then after scaling them as

$$\widehat{z_j} := \frac{z_j \log z_j}{2\pi}$$,

we obtain that

$$R(a,b) := \lim_{n \to \infty} \frac{1}{n} |\{ (k_1, k_2): k_1 \neq k_2, 1 \leq k_1, k_2 \leq n, \widehat{z}_{k_1} - \widehat{z}_{k_2} \in (a,b) \}| \\ = \int_{a}^{b} 1 - \big(\frac{\sin \pi x}{\pi x} \big)^2 dx.$$

In other words, the distance of zeros follows a specific pattern. This formula, as we shall see later, will appear very unexpectedly in a different area of mathematics.

2. Scattering theory. Suppose that a nucleus is being shot with neutrons. Depending on the energy level of the neutron, it can bounce back, bounce off or go straight through the nucleus. If we count that how many neutrons pass through the nucleus and plot it against the energy level, we obtain this. Scattering plot for Gadolinium 156 nucleus. (The source of the image is , but the original one can be found in the article C. Coceva and M. Stefanon, Experimental aspects of the statistical theory of nuclear spectra fluctuations, Nuclear Physics A, 1979, vol. 315.)

Looking at the energy levels where the count peaks - the so-called scattering resonances - we can observe the same behavior as in the zeros of the Riemann zeta function. They are again not too far nor too close to each other in average. This phenomenon was successfully modelled by Eugene Wigner using random matrix theory. We will see later that the model proposed by Wigner can be used to describe a large number of phenomena in the nature.

3. Buses in Cuernavaca, Mexico. In the country of Mexico, there is a town called Cuernavaca. For reasons unknown to me, there is no organized public transport system in there. If you own a bus, you can participate in public transportation by picking up passengers along routes, who buy bus tickets directly from you. The system is basically governed by a set of “rules”:

• the buses are owned by the drivers,
• whom have to maximize their profit by picking up as many passengers as possible,
• and there are so-called spotters, who informs the drivers about when did the previous bus left the next bus stop.

If the previous bus picked up the passengers recently, the driver must slow down to maximize the number of passengers. On the other hand, if the previous bus left long ago, he should speed up, avoiding that somebody overtakes him.

Can you guess what will we find if we study the time elapsed between two subsequent buses at a bus stop? You are probably right: the waiting time between buses is neither too long nor too small. The physicists Milan Krbalek and Petr Seba noticed this behavior when attending a conference at the city. In their paper they did a detailed analysis of this and modelled it with random matrices. They calculated the distribution of time between subsequent buses and obtained the following. Source: Original: M. Krbalek and P. Seba, The statistical properties of the city transport in Cuernavaca (Mexico) and Random matrix ensembles.

The histogram is the empirical data and the continuous curve is the prediction given by their model. Can you guess what kind of model did they propose? Yes, you are right: random matrices.

A common ground: random matrix theory

1. A crash-course in random matrix theory. In order to get a good grip on universality limits (or at least, on their significance), we have to talk about random matrix theory very-very briefly. For a deeper discussion, I recommend the book [Deift-Gioev]. There are many excellent books out there, but from my point of view (to be more precise, from the orthogonal polynomials point of view) it is the most useful.

Let $H_n \subseteq{C}^{n \times n}$ denote the $n \times n$ Hermitian matrices. Select a matrix $\mathcal{M}$ from $H_n$ randomly such that the probability of $\mathcal{M}$ being chosen from $E \subseteq H_n$ is

$$P(\mathcal{M} \in E) = \int_E \lambda e^{-\text{tr} V(M)} dM,$$

where
$dM$ is the product measure on the algebraic independent entries of $\mathcal{M}$,
$\lambda$ is a normalizing constant (so that $P(\mathcal{M} \in H_n) = 1)$,
$V: \mathbb{R} \to \mathbb{R}$ is a function which tends to $0$ at $\pm \infty$ sufficiently fast.

The expression $V(\mathcal{M})$ is interpreted in the sense of functional calculus. (If this bothers you, feel free to assume that $V(x)$ is a polynomial $\sum_{k=0}^{m} a_k x^k$. This way $V(\mathcal{M})$ can be interpreted as $V(\mathcal{M}) = \sum_{k=0}^{m} a_k \mathcal{M}^k$.)

Example. Let $V(x) = x^2$, then the probability measure

$$P(\mathcal{M} \in E) = \int_E \lambda e^{-\text{tr}(M^2)} dM$$

is called a Gaussian unitary ensemble.

We only care about the eigenvalues of a random matrix. Why? Because they can be used to model various phenomena. For one, all three examples which we’ve seen can be modelled with the eigenvalues of a random matrix. (Do not forget that since we are looking at Hermitian matrices, all the eigenvalues are real.) We denote the eigenvalues of $\mathcal{M}$ with $\lambda_1(\mathcal{M}) \leq \dots \leq \lambda_n(\mathcal{M})$. (If not necessary, the dependence on $\mathcal{M}$ is omitted.) For our purposes it is enough to look at the function $p(x_1, \dots, x_n)$ for which

$$P((\lambda_1(\mathcal{M}), \dots, \lambda_n(\mathcal{M})) \in E) = \int_E p(x_1, \dots, x_n) dx_1 \dots dx_n.$$

This is the joint probability density of the ordered $n$-tuple of eigenvalues. With it, we can define the k-point correlation function as

$$R(x_1, \dots, x_k) = \frac{n!}{(n-k)!} \int \dots \int p(x_1, \dots, x_n) dx_{k+1} \dots dx_n.$$

Note that $R_k$ depends on $n$. As it can be found in an article of Craig A. Tracy and Harold Widom,

“It is, loosely speaking, the probability density that $k$ of the eigenvalues, irrespective of order, lie in the infinitesimal neighborhoods of $x_1, \dots, x_k.$ (It is not a probability density in the strict sense since its total integral equals $n!/(n-k)!$ rather then $1$.)”

The main question for us is how does $R_k$ behave when $n \to \infty$? To provide an answer, we will deploy our favourite (or at least, my favourite) tools: orthogonal polynomials.

2. Orthogonal polynomials $\heartsuit$ random matrices. Recall that the orthogonal polynomials with respect to a Borel measure $\mu$ are defined as the unique orthonormal system of polynomials $\{p_n(x,\mu) = p_n(x)\}_{n=0}^{\infty}$ for which $p_n(x) = \gamma_n x^n + \dots, \gamma_n > 0$. (For a brief introduction, see the wikipedia page or the first post.) Every polynomial $\Pi_n$ of degree $n$ can be written as a linear combination of orthogonal polynomials

$$\Pi_n(x) = \sum_{k=0}^{n} a_k p_k(x).$$

If we introduce the so-called reproducing kernel (or kernel in short)

$$K_n(x,y,\mu) = K_n(x,y) = \sum_{k=0}^{n} p_k(x)p_k(y),$$

the above linear combination can be written in the form

$$\Pi_n(x) = \int K_n(x,y) \Pi_n (y) d\mu(y).$$

Now let $\mu$ be a Borel measure on the real line, and for simplicity suppose that $d\mu(x) = w(x) dx$ with $w(x) = e^{-V(x)}$, as used in the previous section about random matrices. It turns out that the k-point correlation function $R_k(x_1, \dots, x_k)$ can be expressed in terms of the reproducing kernel as

$$R_k(x_1, \dots, x_k) = \det (\sqrt{w(x) w(y)}K_n(x_i,x_j))_{i,j=1}^{k}.$$

If we introduce the normalized reproducing kernel $\widetilde{K_n}(x,y) = w(x)^{1/2}w(y)^{1/2} K_n(x,y)$, the above formula can be written as

$$R_k(x_1, \dots, x_k) = \det (\widetilde{K_n}(x_i,x_j))_{i,j=1}^{k}.$$

Thus, using orthogonal polynomials, this connection can be exploited heavily. In the following section we shall see a way to do this.

Universality limits

1. A first look at universality limits. If $\mu$ is a Borel measure on $\mathbb{R}$ with finite moments, we say that universality holds at $x_0$ if

$$\lim_{n \to \infty} \frac{K_n(x_0 + a/n, x_0 + b/n)}{K_n(x_0, x_0)} = \frac{\sin \pi (b-a)}{\pi (b-a)}$$

for $a, b \in \mathbb{R}$ uniformly in compact sets. These kind of limits will be called universality limits from now on. The title “universal” seems justified, since the right hand side does not depend on the original measure. This limit condition may seem a bit random, but if we look more closely, we can find similar phenomena in other places. For example, the central limit theorem says that if $X_1, X_2, \dots$ are independent identically distributed random variables, then (under some additional conditions) we have

$$\frac{\sum_{i=1}^{n} (X_i - E(X_1))}{\sqrt{n}} \to \mathcal{N}(0,\sigma^2)$$

in distribution, where $\mathcal{N}(0,\sigma^2)$ denotes the normal distribution with mean $0$ and variance $\sigma^2$, moreover $E(X_1)$ denotes the expected value of $X_1$. Another similar statement is the law of large numbers. If $X_1, X_2, \dots$ are again independent identically distributed random variables, then

$$\frac{\sum_{i=1}^{n} X_i}{n} \to E(X_1)$$

almost everywhere. One common thing about the central limit theorem and the law of large numbers is that a scaling appears in both limits. Although it is not clear immediately, it is also the case for universality limits.

2. What if universality holds? If universality holds at $x_0$, then we have

$$\lim_{n \to \infty} \frac{1}{\widetilde{K_n}(x_0,x_0,\mu)^k} R_k(x_0 + \xi_1/n, \dots, x_0 + \xi_k/n)$$

$$= \lim_{n \to \infty} \det \Big( \frac{\widetilde{K_n}(x_0 + \xi_i/n, x_0 + \xi_j/n)}{\widetilde{K_n}(x_0,x_0)} \Big)_{i,j=1}^{n}$$

$$= \det \Big( \frac{\sin \pi(\xi_i - \xi_j)}{\pi (\xi_i - \xi_j)} \Big)_{i,j=1}^{n}.$$

The following theorem says that what we did above is basically a scaling.

Theorem 1. If $\mu$ is a Borel measure on $\mathbb{R}$ and
(i) $\text{supp}(\mu) = E$,
(ii) $\mu$ is absolutely continuous with $d\mu(x) = w(x)dx$, where $w$ is continuous and $w(x) > 0$ almost everywhere on $E$,
then for all $x_0 \in \text{int}(E)$ we have

$$\lim_{n \to \infty} \frac{n}{\widetilde{K_n}(x_0,x_0)} = \frac{1}{\omega_E(x_0)}$$,

where the function $\omega_E$ depends only on the set $E$ and not on the measure $\mu$. $\Box$

For example, if we take $d\mu(x) = \frac{1}{\sqrt{1-x^2}}$, we see the following. $$n/\widetilde{K_n}(x,x,\mu)$$, where $$n \in \{1,2,\dots,100\}.$$

This limit distribution is known by physicists as Wigner semicircle distribution, and it appears frequently in random matrix theory.

Using Theorem 1, we have

$$\lim_{n \to \infty} \frac{1}{(n\omega_E(x_0))^k} R_k(x_0 + \xi_1/n, \dots, x_0 + \xi_n/n)$$

$$= \lim_{n \to \infty} \frac{\widetilde{K_n}(x_0,x_0)^k}{n^k \omega_E(x_0)^k} \frac{1}{\widetilde{K_n}(x_0,x_0,\mu)^k} R_k(x_0 + \xi_1/n, \dots, x_0 + \xi_k/n)$$

$$= \det \Big( \frac{\sin \pi (\xi_i - \xi_j)}{\pi (\xi_i - \xi_j)} \Big)_{i,j=1}^{n}.$$

It can be clearly seen that this limit is indeed a scaling.

3. How do we establish universality limits? I’ve talked about consequences of universality but I haven’t discussed how to actually prove universality limits. First let’s assume that the following theorem holds.

Theorem 2. If $\mu$ is the Lebesgue mesure in $[-1,1]$ (i.e. $d\mu(x) = \chi_{[-1,1]}(x)dx$), then universality holds for all $x_0 \in (-1,1)$, that is

$$\lim_{n \to \infty} \frac{\widetilde{K_n}(x_0 + a/n, x_0 + b/n, \mu)}{\widetilde{K_n}(x_0, x_0, \mu)} = \frac{\sin \pi (b-a)}{\pi (b-a)}$$

for $a,b$ uniformly in compact subsets. $\Box$

This theorem can be generalized in more than one ways, for example
• we can study more general measures supported on $[-1,1]$,
• or study measures supported on more complicated sets.

Comparison method of Lubinsky. Let’s see the first one. Suppose that
$\mu$ and $\nu$ are two Borel measures on $[-1,1]$,
$x_0 \in (-1,1)$,
• and suppose that $\mu$ and $\nu$ is equal in some neighbourhood of $x_0$.

The following theorem was proven by Doron S. Lubinsky, and it was a huge breakthrough in the study of universality limits.

Theorem 3. (D. S. Lubinsky) If universality holds for $\mu$ at $x_0$ and both measures are “nice” in some sense (but can be much less nicer then the Lebesgue measure), then universality holds for $\nu$ at $x_0$. $\Box$

The theorem and the proof can be found in its full beauty in the article [Lubinsky]. I tried to avoid technical details here to compress the theorem into a fully digestible form, but if you are interested in those details (which will be discussed in a future post), I strongly recommend this article.

Theorem 3 can be used to establish universality limits for a bunch of measures. For example, let

$\mu$ be the Lebesgue measure on $[-1,1]$,
$\nu$ be an absolutely continuous measure with $d\mu(x) = w(x) dx$,
$w > 0$ almost everywhere on $[-1,1]$,
• and $w(x) = 1$ in some neighbourhood of $x_0$.

Then, since universality holds for the Lebesgue measure, it also holds for our much general measure $\nu$. The big thing about the comparison method is that before this article, universality was known only for analytic weights. As you can see in this previous example, universality can be established for continuous weights, which is a huge step.

The polynomial inverse image method of Totik. This time let’s take a look at measures with compact support. We will extend Theorem 2 in two steps. First let $T_N(x)$ be a polynomial of degree $N$ and suppose that if $t_0$ is a local extrema for $T_N$, then $\|T_N(t_0)\| \geq 1$. A polynomial like this will be called admissible. Now let $\mu$ be a Borel measure supported on $[-1,1]$ and suppose that universality holds for $\mu$ at $0$. With this setup, the following theorem holds.

Theorem 4. (V. Totik) If $\nu$ is the pullback measure for $\mu$ on $E = T_{N}^{-1}([-1,1])$ and $T_N(x_0) = 0$, then universality holds for $\nu$ at $x_0$. $\Box$

This takes care of measures with support $T_{N}^{-1}([-1,1])$, where $T_N$ is an admissible polynomial. Now let $\mu$ is a measure with support $K \subseteq \mathbb{R}$ where $K$ is an arbitrary compact set. It turns out that in some sense $K$ can be approximated with sets of the form $T_{N}^{-1}([-1,1])$ with arbitrary precision! Using this, we can compare our measure $\mu$ to other measures supported on the approximating sets. Overall, the following theorem holds.

Theorem 5. If $\mu$ is a “nice” measure with compact support, $x_0 \in \text{supp}(\mu)$ and $d\mu(x) = w(x) dx$ in a neighbourhood of $x_0$ with positive and continuous $w$, then universality holds at $x_0$. $\Box$

The method of polynomial inverse images was perfected in [Totik1], where the process is described in detail. Originally it was used to prove polynomial inequalities on compact sets, but the method found many applications, for example universality limits, which can be found in [Totik2].

Epilogue: an anecdote from Princeton

I finish this long post with an anecdote. It can be found in numerous articles, for example [Deift].

Some years ago Hugh Montgomery, when his paper about the nontrivial zeros of the Riemann zeta function was published (see the begining of the post), visited the Institute for Advanced Study in Princeton. There he met Freeman Dyson, the legendary mathematical physicist. They had a tea together, and before Montgomery could describe his result to him, Dyson asked

And did you get this?

Then he picked up a pen and wrote down the formula

$$\int_{a}^{b} 1 - \big( \frac{\sin \pi x}{\pi x} \big)^2 dx.$$

Montgomery was shocked, because it was exactly his result. Then Dyson replied

If the zeros of the zeta function behaved like the eigenvalues of a random Gaussian unitary matrix, then it would be exactly the formula for the 2-point correlation function!

I think this justifies the label “universality”.

References. [Deift] Percy Deift, Universality for mathematical and physical systems, available at arXiv
[Deift-Gioev] Percy Deift and Dimitri Gioev, Random matrix theory: invariant ensembles and universality, Courant Lecture Notes, 2009
[Lubinsky] Doron S. Lubinsky, A New Approach to Universality Limits Involving Orthogonal Polynomials, Annals of Mathematics, 170 (2009), 915-939.
[Tao] Terence Tao, A second draft of a non-technical article on universality, blog post
[Totik1] Vilmos Totik, Polynomial inverse images and polynomial inequalities, Acta Mathematica, 187 (2001), 139-160
[Totik2] Vilmos Totik, Universality and fine zero spacing on general sets, Arkiv för Mathematik, 47 (2009), 361-391

Updated: