Notes on Gaussians

Posted on March 11, 2015

The Gaussian or normal distribution has been quite a mystery for me quite some time. Sure, the pretty pictures with the bell curves sort of help, but they certainly do not explain why this monster gets a right to roam the earth:

\[ \frac{1}{\sqrt{(2 \pi)^n \det \Sigma}} \int_X \exp(-\frac{1}{2}(x - \mu)^\top\Sigma^{-1}(x - \mu)) dx \]

So I thought it would be a useful excercise to try and derive that distribution. I heard you didn’t really understand something, unless you are able to explain it, so here we go. I don’t know, if this article is going to be helpful, but writing it down and publishing it certainly helps me to stay focused and work on the topic for a while.

I will go straight on to the “multivariate” Gaussian distribution, i.e. the \(n\)-dimensional. The \(1\)-dimensional “univariate” distribution then is a special case. This way some choices will become natural which seem arbitrary in one dimension.

At some places I will use some integrals which are tricky to calculate. Since I haven’t found any calculation for them on the internet and since I found the resulting values quite surprising, I included the full calculations in the appendix, i.e. the last section of this article. Despite their rather improminent position, they are really the meat of this post and a great excercise. So feel free to scroll down when things seem strange.

Preliminary Thoughts

We want to model \(n\) inter-dependent quantities that have a “typical” value, but due to the nature of our world are randomly perturbed. Really strong deviations from the typical value should be really unlikely, while small deviations can happen all the time.

Formally: Given a vector \(\mu \in \mathbb{R}\) and a symmetric positive definite matrix \(\Sigma \in \mathbb{R}^{n \times n}\) we are looking for a probability distribution \(P : \mathcal{B}(\mathbb{R}^n) \to [0, \infty]\) on \(\mathbb{R}^n\) with the following qualitites: its expected value should be \(\mu\) and for \(1 \leq i, j \leq n\) the covariance \(\text{Cov}_P(x_i, x_j)\) of the canonical projections \(x_i, x_j : \mathbb{R}^n \to \mathbb{R}\) should be \(\Sigma_{i, j}\), respectively.

Deviations from the Expectation

In order to formalize the notion of how much we deviate from the expected value, we can use the Mahalanobis distance: Since \(\Sigma\) is positive definite, it is also invertible, and we can define the Mahalanobis distance on \(\mathbb{R}^n\) parameterized by \(\mu\) and \(\Sigma\) by

\[ D_M : \mathbb{R}^n \to \mathbb{R}_+\\ x \mapsto \sqrt {(x - \mu)^\top \Sigma^{-1} (x - \mu)}. \]

This term intuitively is the euclidean norm of \(x - \mu\), i.e. the distance from \(x\) to \(\mu\), scaled by \(\Sigma^{-1}\). In the case of \(n = 1\), we have \(\sigma = \Sigma \in \mathbb{R}\) and \(D_M\) becomes

\[x \mapsto \frac{|x - \mu|}{\sigma}.\]

This matches the \((x - \mu)^\top \Sigma^{-1} (x - \mu)\) term from the finished Gaussian distribution quite closely, so we know we are on the right track.

Exponential Decline

We now need a function \(\mathbb{R}^n \to \mathbb{R}^+\) that is symmetrical, has a unique maximum and declines exponentially from that maximum. \(x \mapsto \exp(-x)\) is the canoncial choice for such a function. A first idea would be to use \(x \mapsto \exp(-D_M(x))\) then, but the integral of that function on \(\mathbb{R}^n\) does not converge. Squaring the \(D_M(x)\) term and adding a factor of \(\frac{1}{2}\) to compensate for the extra factor when deriving the term later, we get

\[x \mapsto \exp(-\frac{1}{2}D_M(x)^2).\]

Normalizing

In order to use that function as the cumulative distribution function for a probability distribution, it needs to integrate to \(1\) on \(\mathbb{R}^n\). So we need to calculate the integral and normalize the function appropriately. So we need to calculate

\[ \int_{\mathbb{R}^n} \exp(-\frac{1}{2}D_M(x)^2) dx = \int_{\mathbb{R}^n} \exp(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)) dx \]

Remember that we required \(\Sigma\) to be symmetric positive definite; then so is its inverse. Thus \(\Sigma^{-1}\) admits a Cholesky decomposition \(\Sigma^{-1} = B^\top B\) for some matrix \(B \in \mathbb{R}^{n \times n}\). Then we can write

\[ (x - \mu)^\top \Sigma^{-1} (x - \mu) = (x - \mu)^\top B^\top B (x - \mu) = (B (x - \mu))^\top (B (x - \mu)) \]

and use a transformation

\[ T_n : \mathbb{R}^n \to \mathbb{R}^n \\ x \mapsto \frac{1}{\sqrt{2}} B(x - \mu). \]

Deriving \(T_n\) by \(x\) yields

\[ \frac{\partial T_n}{\partial x}(x) = \frac{1}{\sqrt{2}} B^\top \Rightarrow \det(\frac{\partial T_n}{\partial x}(x)) = (2^k \det(\Sigma))^{-\frac{1}{2}}. \]

Appyling \(T_n\) to the integral leaves us with

\[ \int_{\mathbb{R}} \exp(-(x - \mu)^\top \Sigma^{-1} (x - \mu)) dx = \int_{\mathbb{R}} \exp(-T(x)^\top T(x)) dx =\\ \sqrt {2^k \det(\Sigma)} \int_{\mathbb{R}} \exp(-||s||^2) ds = \sqrt {(2\pi)^n \det(\Sigma)} \]

You can find the derivation of that last integral in the appendix. Now we can normalize the function to arrive at the Gaussian distribution:

\[ \frac{1}{\sqrt {(2\pi)^n \det(\Sigma)}} \int_X \exp(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)) dx \]

Expected Value

Let us now check, that \(\mu\) really is the expected value of that distribution. This should be quite clear intuitively, since the cdf is symmetric in a way and has its maximum at \(\mu\), but it doesn’t hurt to check it formally.

In order to do that we can calculate the expected value of the distribution minus \(\mu\) and can check whether it is \(0\). In other words, the following integral should vanish:

\[ \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \int_{\mathbb{R}^n} (x - \mu) \exp(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)) dx \]

First, recall \(T_n\) from the section on normalizing in \(n\)-dimensions. Solving \(T_n(x)\) for \(x\) yields

\[ x = \sqrt{2} B^{-1} T_n(x) + \mu\]

We can then apply that transformation to the integral and get

\[ \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \int_{\mathbb{R}^n} (x - \mu) \exp(-T_n(x)^\top T_n(x)) dx =\\ \pi^{-\frac{n}{2}} \sqrt{2}B^{-1} \int_{\mathbb{R}^n} s \exp(-s^\top s) ds = 0. \]

Again, you can find the calculation for the last integral in the appendix.

Covariance

There’s one last thing left, which is the (co)variance of the distribution. We have supplied the covariance matrix \(\Sigma\) to the distribution and expect it to assume these covariances. In other words, we check that

\[ \Sigma = \int_{\mathbb{R}^n} (x - \mu)(x - \mu)^\top \exp(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)) dx. \]

As in the previous section, we apply the transformation \(T_n\) and get

\[ \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \int_{\mathbb{R}^n} (\sqrt{2}B^{-1}T_n(x))(\sqrt{2}B^{-1}T_n(x))^\top \exp(-T_n(x)^\top T_n(x)) dx. \]

When we pull out the multiplications, we get

\[ \frac{2}{\sqrt{(2\pi)^n \det(\Sigma)}}B^{-1} \int_{\mathbb{R}^n} T_n(x)T_n(x)^\top \exp(-T_n(x)^\top T_n(x)) dx (B^{-1})^\top. \]

Substituting \(s\) for \(T(x)\) lets us get rid of some of the constants and with a calculation you can find in the appendix we arrive at

\[ 2\pi^{-\frac{n}{2}}B^{-1} \int_{\mathbb{R}^n} ss^\top \exp(-s^\top s) ds (B^{-1})^\top = 2\pi^{-\frac{n}{2}}B^{-1} \frac{1}{2}\pi^{\frac{n}{2}} I (B^{-1})^\top = B^{-1}(B^{-1})^\top = \Sigma \]

Conclusion

Preparing this article, I discovered that doing analysis is actually quite fun if you don’t do it for an exam or homework excercise. Especially with formulas that absurd there is a certain satisfaction when everything turns out just as you wanted to. I hope reading this article has proven as helpful as it was for me writing it, dear reader.

Appendix: Calculating the Integrals

Lemma 1: For all \(k > 0\) \[ \int_{\mathbb{R}^k} \exp(-||x||^2) dx = \pi^{\frac{k}{2}}. \]

Proof: By induction over \(k\). Starting with \(k = 1\), we can calculate the square of the desired integral with polar coordinates

\[ (\int_{\mathbb{R}} \exp(-s^2) ds)^2 = \int_{\mathbb{R}} \int_{\mathbb{R}} \exp(-s^2) \exp(-t^2) ds dt = \int_{\mathbb{R}} \int_{\mathbb{R}} \exp(-s^2) \exp(-t^2) ds dt = \\ \int_{\mathbb{R}^2} \exp(-(s^2 + t^2)) d(s, t) = \int_0^{2 \pi} \int_{0}^{\infty} r \exp(-r^2) dr d\phi = 2 \pi \int_{0}^{\infty} r \exp(-r^2) dr \]

Transforming with \(r \mapsto r^2\) this becomes \(\pi\).

For dimension \(k + 1\), by induction

\[ \int_{\mathbb{R}^{k+1}} e^{-||x||^2} dx = \int_\mathbb{R} \int_{\mathbb{R}^{k}} \exp(-(||x||^2 + y^2)) dx dy =\\ \int_\mathbb{R} \exp(-y^2) \int_{\mathbb{R}^{k}} \exp(-||x||^2) dx dy = \pi^{\frac{k}{2}} \int_\mathbb{R} \exp(-y^2) dy = \pi^{\frac{k}{2}} \sqrt{\pi} = \pi^{\frac{k+1}{2}}. \]

Lemma 2: For all \(k > 0\)

\[\int_{\mathbb{R}^n} s \exp(-s^\top s) ds = 0\]

Proof: By induction over \(k\). For the base case transforming with \(s \mapsto s^2\) yields

\[ \int_{\mathbb{R}} s \exp(-s^2) ds = 0 \]

For dimension \(k+1\), we first calculate the \(i\)th component of the integral for \(1 \leq i \leq k\) by induction:

\[ \int_{\mathbb{R}^{k + 1}} s_i \exp(-s^\top s) ds = \int_{\mathbb{R}} \int_{\mathbb{R}^k} s_i \exp(-s^\top s - t^2) ds dt =\\ \int_{\mathbb{R}} \exp(-t^2) dt \int_{\mathbb{R}^k} s_i \exp(-s^\top s) ds = 0 \]

And the \(k+1\)th dimension:

\[ \int_{\mathbb{R}^{k + 1}} s_{k+1} \exp(-s^\top s) ds = \int_{\mathbb{R}^k} \int_{\mathbb{R}} t \exp(-s^\top s - t^2) dt ds =\\ \int_{\mathbb{R}^k} \exp(-s^\top s) ds \int_{\mathbb{R}} t \exp(-t^2) dt = 0. \]

Lemma 3: For all \(k > 0\)

\[ \int_{\mathbb{R}^n} ss^\top \exp(-s^\top s) ds = \frac{1}{2}\pi^{\frac{n}{2}} I \]

Proof: Observe that \((ss^\top)_{i, j} = s_i s_j\). We first show the claim on the diagonal by induction. For \(k = 1\), partial integration yields

\[ \int_{\mathbb{R}} s^2 \exp(-s^2) ds = -\frac{1}{2} \int_{\mathbb{R}} s (-2s\exp(-s^2)) ds = 0 + \frac{1}{2} \int_{\mathbb{R}} exp(-s^2) ds = \frac{\sqrt{\pi}}{2}. \]

Suppose \(0 \leq i \leq k\) for some dimension \(k\). Then by induction

\[ \int_{\mathbb{R}^{k + 1}} s_i^2 \exp(-s^\top s) ds = \int_{\mathbb{R}^k} \int_{\mathbb{R}} s_i^2 \exp(-s^\top s - t^2) dt ds =\\ \int_{\mathbb{R}^k} s_i^2 \exp(-s^\top s) \int_{\mathbb{R}} \exp(-t^2) dt ds =\\ \frac{\sqrt{\pi}}{2} \int_{\mathbb{R}^k} s_i^2 \exp(-s^\top s) ds = \frac{1}{2} \pi^{\frac{k + 1}{2}}. \]

Now consider for the new element on the diagonal

\[ \int_{\mathbb{R}^{k + 1}} s_{k+1}^2 \exp(-s^\top s) ds = \int_{\mathbb{R}} \int_{\mathbb{R}^k} t^2 \exp(-s^\top s - t^2) ds dt =\\ \int_{\mathbb{R}} t^2 \exp(-t^2) \int_{\mathbb{R}^k} \exp(-s^\top s) ds dt =\\ \pi^{\frac{k}{2}} \int_{\mathbb{R}^k} t^2 \exp(-t^2) dt = \frac{1}{2} \pi^{\frac{k + 1}{2}}. \]

Now we wish to show that the integral vanishes outside of the diagonal. For dimension \(k = 1\) this is pretty trivial, since there is no diagonal; so suppose \(k > 1\). Let \(0 \leq i < j \leq n\), then

\[ \int_{\mathbb{R}^k} s_i s_j \exp(-s^\top s) ds = \int_{\mathbb{R}^2} \int_{\mathbb{R}^{k - 2}} xy \exp(-s^\top s - x^2 - y^2) ds d(x, y) =\\ \int_{\mathbb{R}^2} xy \exp(-s^\top s - x^2 - y^2) \int_{\mathbb{R}^{k - 2}} \exp(-s^\top s) ds d(x, y) =\\ \pi^{\frac{k-2}{2}}\int_{\mathbb{R}^2} xy \exp(-x^2 - y^2) d(x, y) \]

Using polar coordinates this becomes

\[ \int_0^{2\pi} \int_0^\infty \cos(\phi)\sin(\phi) \exp(-r^2) dr d\phi = \int_0^{2\pi} \cos(\phi)\sin(\phi) d\phi \int_0^\infty \exp(-r^2) dr = 0 \]