Any student of mathematics learns that the determinant of a matrix can be interpreted as the amount that the matrix distorts signed volume. The trace is another linear algebraic invariant but its interpretations are not common knowledge. This is somewhat troubling since the trace, being such a basic invariant, occurs in many proofs across mathematics. If we view a proof as an explanation of a phenomenon then some part of the explanation consequently remains mysterious.
Interpretations of the trace do however exist! There are a number of different but related explanations which can be useful in different contexts. The purpose of this post is to provide some interpretations which I have found to be particularly useful. More precisely, the following interpretations will be justified:
- The trace is the aggregated amount of self-mapping.
- The normalized trace is the expected spectral measurement. (For self-adjoint operators.)
- The trace is a smooth dimension-counting function. (For positive semidefinite operators.)
Each interpretation is first based on mathematical arguments and thereafter applied to some nontrivial examples. These examples may be viewed as the main content of the post since they show that one can really de-mystify statements using the interpretations.
Throughout this post let $V$ denote a finite-dimensional vector space over the complex numbers and let $L:V\to V$ denote a linear operator.
The trace is the aggregated amount of self-mapping.
I here argue that the trace measures to what extent the operator maps vectors back onto themselves. Three different argument are provided after which I discuss some examples of this interpretation.
Basis-dependent argument
A particularly simple but non-canonical argument to arrive at this conclusion may be found by picking a basis $e_1,\ldots,e_n$ for $V$. Given such a basis we may identify $L$ with a matrix in $\mathbb{C}^{n\times n}$ and it holds that $$\operatorname{Tr}L = \sum_{i=1}^n L_{ii}.$$ Here $L_{ii}$ is nothing more than the coordinate at $e_i$ of $Le_i$ which could indeed be viewed as some type of self-mapping.
This perspective is clarified further if we additionally equip $V$ with an inner product $\langle \cdot, \cdot \rangle$ and assume that the basis $e_1,\ldots,e_n$ is orthonormal so that $$\operatorname{Tr}L = \sum_{i=1}^n \langle e_i, Le_i \rangle.$$ Then it is clear that $\langle e_i, L e_i \rangle$ measures to what extent $e_i$ is mapped back onto itself. Consequently, the trace measures the amount of self mapping when aggregated over all basis vectors $e_i$.
Example. (Dilation, rotation, reflection)
Let $\operatorname{dim}V = 2$ and consider the following matrices which represent a dilation by a factor 3, a rotation by $90$ degrees and a reflection about the $x$-axis respectively: $$L_{\text{Dilate}} = \begin{pmatrix} 3 & 0 \\ 0 & 3 \end{pmatrix},\ L_{\text{Reflect}} = \begin{pmatrix} 1 & 0\\ 0 & -1 \end{pmatrix},\ L_{\text{Rotate}} = \begin{pmatrix} 0 & -1\\ 1 & 0 \end{pmatrix}.$$
Then $L_{\text{Dilate}}$ maps any vector onto itself with a positive dilation of a factor $3$. Consequently, the aggregated amount of self-mapping is $$\operatorname{Tr} L_{\text{Dilate}} = 3 \operatorname{dim}V = 6.$$ The reflection $L_{\text{Reflect}}$ maps the vector $(1,0)$ onto itself in a positive direction but maps the vector $(0,1)$ onto itself in a negative direction. The aggregated amount of self-mapping is hence $$\operatorname{Tr}L_\text{Reflect} = 1 – 1 = 0.$$ Finally, $L_\text{Rotate}$ rotates every real vector to a vector which is orthogonal to it. Consequently, there is no self-mapping and $$\operatorname{Tr}L_\text{Rotate} =0.$$
Eigenspace arguments
The foregoing argument is simple but somewhat unsatisfactory since a basis and an inner product aught not to be necessary to understand the trace. The trace is namely a canonical property of linear operators on vector spaces meaning that it is basis-independent. To get more insight we could rather rely on the canonical definitition of the trace involving eigenvalues.
For simplicity, let us assume that $L$ is diagonalizable with real eigenvalues. Denote $\lambda_1,\ldots,\lambda_n\in \mathbb{R}$ for the eigenvalues of $L$ and let $v_1,\ldots,v_n$ be an associated eigenbasis. Then, by definition of eigenvalues, $Lv_i = \lambda_i v_i$ which we may interpret as stating that the eigenvalue $\lambda_i$ measures how much $v_i$ is mapped back onto itself by $L$. If $\lambda_i \gg 0$ then this means that $v_i$ is resolutely mapped onto itself. On the other hand, $\lambda_i<0$ can be viewed as negative self-mapping. The aggregated amount of self-mapping is then given by $$\operatorname{Tr}L = \sum_{i=1}^n \lambda_i$$ Complex eigenvalues are harder to visualize but it is possible with a bit of imagination: an eigenvalue $\lambda_i = r e^{i\theta}$ can namely be viewed as a self-mapping of $v_i$ up to a dilation by a factor $r$ and modification of its phase with an angle of $\theta$. One can further get rid of the assumption that $L$ is diagonalizable by consideration of the Jordan normal form.
The identification $\operatorname{Hom}_{\mathbb{C}}(V,V) \cong V\otimes V^*$
This final argument is also insightful in more abstract settings such as category theory. Note that there is a canonical identification $V\otimes V^* \cong \operatorname{Hom}_{\mathbb{C}}(V,V)$ given by $v\otimes \varphi \mapsto L_{v,\varphi}$ where $L_{v,\varphi}:V\to V$ is the map given by $L_{v,\varphi}(w) = \varphi(w)v$. Using this identification the trace may be defined by linearly extending $$\operatorname{Tr}L_{v,\varphi} = \varphi(v).$$
This has the following interpretation: “The trace is what you get when an operator eats itself.” Such a self-eating interpretation is particularly explicit in the diagrams which occur in the categorical treatment of the trace as may be found in this expository note by Kate Ponto and Michael Shulman.
Examples
Example. (Closed paths in a graph)
Let $G$ be a graph on $n$ vertices which may be directed and can possibly have loops. Denote $A\in \{0,1 \}^{n\times n}$ for the adjacency matrix of this graph. Then clearly $\operatorname{Tr}A$ denotes the number of loops in the graph. More generally, one can check with a direct computation that for any $k\in \mathbb{Z}_{\geq 1}$ it holds that $\operatorname{Tr}A^k$ counts the number of closed paths of length $k$. Loops and closed paths can both be viewed as a type of self-mapping consistent with the current interpretation.
Example. ($y’ = Ly$)
Equip $V$ with an inner product and consider the differential equation $y’ = Ly$. Take a random initial condition $y(0)$ which is uniformly spread on the sphere. Then, $$\begin{align}\mathbb{E}\Bigl[\frac{d}{dt}\bigl\langle y(0), y(t) \bigr\rangle\mid_{t=0}\Bigr] = \mathbb{E}[\langle y(0), Ly(0) \rangle] = \frac{1}{N}\operatorname{Tr}L.
\end{align}$$ Here the final step used the isotropic structure of random vectors on the sphere $\mathbb{E}[y(0)y(0)^T] = \frac{1}{N}\operatorname{Id}$. Thus, the trace can be interpreted as measuring the rate at which a random initial condition evolves in its own direction.
Note that that the solutions to $y’ = Ly$ are given by $y(t) = \exp(tL)y(0)$. The above interpretation consequently also helps to understand the following relation between the determinant and the trace: $$\frac{d}{dt}\det\exp(tL)\mid_{t=0} = \operatorname{Tr}L .$$ This identity can namely be understood as stating that the first-order change in volume of a parallelepiped comes from the scaling of the edges.
Example. (Lefschetz fixed point theorem)
Let $f:X\to X$ be a continuous map on a topological space. Assume moreover that the topological space is sufficiently nice; a compact smooth manifold will do the job for our purposes. Then, $f$ induces linear operators on the singular homology vector spaces $F_k:H_k(X,\mathbb{Q})\to H_k(X,\mathbb{Q})$ and the Lefschetz number of $f$ is defined by $$\Lambda_f := \sum_{k\geq 0}(-1)^k \operatorname{Tr}F_k.$$ This trace-based index relates to the number of fixed points. The Lefschetz fixed point theorem namely states that if $\Lambda_f \neq 0$ then $f$ has at least one fixed point $f(x) = x$. This theorem can be further strengthened by the Lefschetz-Hopf theorem which states that $\Lambda_f = \sum_{x\in \operatorname{Fix}(f)} i(f,x)$ where $i$ is the index of the fixed point. Note that the interpretation of the trace somewhat de-mystifies the occurrence of traces in these theorems. The trace namely has to do with self-mapping which is precisely what fixed points are about.
The normalized trace is expected spectral measurement. (For self-adjoint operators.)
Often, the vector space $V$ comes equipped with an inner product $\langle \cdot, \cdot \rangle$. One is typically then only interested in those operator $A:V\to V$ which are self-adjoint. For this special case one can modify the foregoing interpretation and extract more intuition.
I will refer to $n^{-1}\operatorname{Tr}(\cdot)$ as the normalized trace. The key point is that if $\lambda_1,\ldots,\lambda_n$ are the eigenvalues of $L$ and we consider a random index $I\sim \operatorname{Unif}\{1,\ldots,n \}$ then $$\frac{1}{n}\operatorname{Tr}L = \mathbb{E}[\lambda_I].$$ This identity is just a restatement of what we already knew. As such, the identity does not necessarily give us anything new. I will however subsequently argue that it can be sensible to think of an eigenvalue $\lambda_i$ as a measurement of some system. In such a case we may view the normalized trace $n^{-1}\operatorname{Tr}L$ as the expected measurement.
Warning. This interpretation will typically not make sense for operators which are not self-adjoint. For a sensible notion of physical measurement we namely require that $\lambda_i$ is a real number.
The expected measurement in quantum mechanics
This argument unfortunately requires some background in quantum mechanics which may be confusing to the unfamiliar reader. I will now attempt a quick crash course with the basics which are required for our purposes.
Let $\mathcal{H}$ be a Hilbert space and denote $\operatorname{Tr}$ for the associated trace. For simplicity, I will moreover make the assumption that the Hilbert space is finite dimensional $\operatorname{dim}\mathcal{H} =n$. This finite-dimensional assumption would ordinarily be rather dubious since most physical systems require an infinite-dimensional Hilbert space. There are however also systems which can be described with the finite dimensional case and this will be sufficient for our purposes. A positive semidefinite linear operator $\rho:\mathcal{H}\to \mathcal{H}$ with $\operatorname{Tr}\rho =1$ will be called a state. A self-adjoint linear operator $A:\mathcal{H} \to \mathcal{H}$ will be called an observable.
As the name suggests the state $\rho$ describes the state of the system whereas the observable $A$ encodes information about measurements which can be made. One could for instance think of $A$ as encoding the behavior of a measuring device which has been attached to some physical system. More precisely, the possible measurements are given by the spectrum $\operatorname{spec}A \subseteq \mathbb{R}$. Suppose that the system finds itself in some state $\rho$ at the time of measurement. Then, the probability of reading out the value $\lambda_i$ on our measuring device is defined to be given by $$\mathbb{P}(\text{Measurement equals }\lambda_i \mid \rho) := \operatorname{Tr}(\rho P_{\lambda_i})$$ where $P_{\lambda_i}$ denotes the projection onto the eigenspace associated to the eigenvalue $\lambda_i$. Note as a sanity check that the probabilities add up to one since $$\sum_{\lambda_i\in \operatorname{spec A}} \operatorname{Tr}(\rho P_{\lambda_i}) = \operatorname{Tr}\rho = 1.$$ It was here used that $\operatorname{Tr}$ is linear and that the spectral projections $P_{\lambda_i}$ satisfy $\sum P_{\lambda_i} = \operatorname{Id}$.
Having completed the crash course we can proceed to our desired interpretation for the trace. Using the definition of the probability of measurements it namely follows that the expected measurement given that the system is in state $\rho$ is given by $$\begin{align}\mathbb{E}[\text{Measurement}\mid \rho] &= \sum_{\lambda_i\in \operatorname{Spec}A} \lambda_i \operatorname{Tr}(\rho P_{\lambda_i})\\ &= \operatorname{Tr}(\rho A).\end{align}$$ This gives us a interpretation for $\operatorname{Tr}\rho A$ as the expected measurement!
We can also get an interpretation for $\operatorname{Tr}A$ itself. Let us namely assume that the system is at a high temperature so that its state $\rho$ is entirely random. This can be interpreted quite literally by imagining that the measurement concerns the magnetic moment of an atom. For a cold atom this magnetic moment could be pointing in some particular direction and would be encoded in $\rho$. If however the atom has been heated to a high temperature then its magnetic moment will point in an essentially random direction. Mathematically, the statement that the state is pointing in equal amounts in all directions corresponds to assuming that $\rho := n^{-1}\operatorname{Id}$. The probability of measuring $\lambda_i$ in a high-temperature system is therefore $$\mathbb{P}(\text{Measure }\lambda_i\mid \text{high temperature}) = \frac{1}{n}\operatorname{Tr}(P_{\lambda_i}).$$ It follows that the expected measurement satisfies $$\begin{align}\mathbb{E}[\text{Measurement}\mid \text{ high temperature}]&= \sum_{\lambda_i\in \operatorname{spec}A} \lambda_i \mathbb{P}(\text{Measure }\lambda_i \mid \text{high temperature})\\ &= \frac{1}{n}\operatorname{Tr}\Bigl(\sum_{\lambda_i \in \operatorname{spec}A} \lambda_i P_{\lambda_i}\Bigr)\\ &= \frac{1}{n}\operatorname{Tr}A.\end{align}$$ This identity precisely states that we may interpret the normalized trace as the expected measurement in a high-temperature system!
Examples
Example. (Tensor product)
The identity $\operatorname{Tr}(A\otimes B) = (\operatorname{Tr}A)(\operatorname{Tr}B)$ has a natural interpretation in the expectation-based viewpoint. It namely follows from fact that $\mathbb{E}[XY] = \mathbb{E}[X] \mathbb{E}[Y]$ for any two independent random variables $X,Y$.
Let $\lambda_1(A),\ldots,\lambda_n(A)$ denote the eigenvalues of $A$ and similarly let $\lambda_1(B),\ldots,\lambda_n(B)$ denote the eigenvalues of $B$. Then the eigenvalues of $A\otimes B$ are given by $\lambda_{i}(A\otimes B) = \lambda_j(A)\lambda_k(B)$ with $i=j + (k-1)n$ and $j,k = 1,\ldots,n$. Let $I\sim \operatorname{Unif}\{1,\ldots,n^2 \}$ be a uniform random index and remark that we can decompose $I = J + (K-1)n$ with $J,K \sim \operatorname{Unif}\{1,\ldots,n \}$ independent random indices. Thus, $$\begin{align*}\frac{1}{n^2}\operatorname{Tr}(A\otimes B) &= \mathbb{E}[\lambda_I(A\otimes B)]\\ &= \mathbb{E}[\lambda_J(A)]\mathbb{E}[\lambda_K(B)]\\&=\Bigl(\frac{1}{n}\operatorname{Tr}A\Bigr)\Bigl(\frac{1}{n}\operatorname{Tr}B\Bigr). \end{align*}$$ This shows that the fact that the trace factorizes over tensor products can be interpreted as a consequence of the fact that the expected value factorizes over products of independent random variables.
Example. (Trace norms)
The most used norm on operators of vector spaces with an inner product is given by the operator norm $\Vert L \Vert_{\operatorname{op}}:= \sup_{\Vert v \Vert = 1} \Vert Lv \Vert$. One can however also use the trace to define norms which are known as trace norms. These trace norms are useful in proofs for statements which have nothing to do with the trace a-priori.
One can define a trace norm for any real $p\geq 1$ and for all operators which need-not even be self-adjoint. For simplicity, let us however restrict ourselves to the case where $A:V\to V$ is a positive operator and $p\in \mathbb{Z}_{\geq 1}$ is an integer. Then, the $p$th trace norm is given by $$\Vert A \Vert_p := \Bigl(\frac{1}{n} \operatorname{Tr}A^p\Bigr)^{1/p}.$$ Trace norms behave analogously to $\mathcal{L}^p$ norms in many respects. This should not be surprising given our interpretation of the trace as an expected measurement. Indeed, letting $I\sim \operatorname{Unif}\{1,\ldots,n\}$ we have that $$\Vert A \Vert_p = \Vert \lambda_I \Vert_{\mathcal{L}^p}.$$ Here, $\Vert \lambda_I \Vert_{\mathcal{L}^p} = \mathbb{E}[\lambda_I^p]^{1/p}$ is the $\mathcal{L}^p$ norm from classical probability.
Example. (Free probability)
Free probability theory is a field invented by Dan Voiculescu around 1985 motivated by certain problems in operator algebras. A beautiful application of the theory is that it can provide limiting objects for large random matrices. One namely has theorems such as the free central limit theorem which help to explain why the Wigner semicircular law has a similar universal behavior for random matrices as the Gaussian distribution has for classical random variables.
We have seen in the foregoing examples that the normalized trace behaves analogously to the expected value in classical probability. This viewpoint is also occurs in the definition of a non-commutative probability space in free probability theory. The cyclic property $\operatorname{Tr}AB = \operatorname{Tr}BA$ here plays an important role. Noncommutativity namely makes some computations more difficult. Correspondingly, it is convenient to use the cyclic property as a substitute for commutativity.
Free probability theory is a beautiful subject but it would lead us too far to go into detail here. See this blog post by Terrence Tao or the list of literature on Roland Speicher’s blog if you wish to learn more.
Remark. The trace is characterized up to rescaling as the unique linear map $\tau:\operatorname{Hom}_{\mathbb{C}}(V,V)\to \mathbb{C}$ with the property that $\tau(AB) = \tau(BA)$. One could correspondingly also adapt the following interpretation “The trace is a linear map with the cyclic property.” In other words, the trace is useful because it has good algebraic properties.
The trace is a smooth dimension-counting function. (For positive semidefinite operators.)
If the vector space $V$ is equipped with an inner product and the operator under consideration is positive semidefinite then one can view $\operatorname{Tr}$ as a linearized version of the rank.
Firstly note that this interpretation is exact if we assume that the operator $P:V\to V$ is a projection onto some subspace $S\subseteq V$. Then it is namely the case that $$\operatorname{Tr}P = \operatorname{dim}S = \operatorname{rank}P.$$ The interpretation as a linear version of the rank is still quite sensible for positive semidefinite operators $\Sigma:V\to V$. Namely observe that the rank is simply counting the number of eigenvalues which are nonzero: $$\operatorname{rank}\Sigma = \#\{i\in \{1,\ldots,n \}: \lambda_i \neq 0 \}.$$ This identity is however ill-behaved under small perturbations of the matrix $\Sigma$. Namely imagine that there is some $k\ll n$ such that $\lambda_{i} = 1$ for $i \leq k$ and $\lambda_i = \varepsilon/(n-k)$ for some small $\varepsilon >0$ when $i>k$. Then, the matrix will be full rank $\operatorname{rank} \Sigma = n$ but nonetheless $$\operatorname{Tr} \Sigma = \sum_{i=1}^n \lambda_i = k + \varepsilon.$$ As such $\operatorname{Tr}\Sigma$ better captures the fact that the matrix really only has $k$ eigenvalues which are far away from zero. This is to say that the trace can capture that the matrix is approximately living in a $k$-dimensional space.
Warning. This interpretation is less sensible for operators which are not positive. Then it could for instance occur that there are negative eigenvalues or even $\operatorname{Tr}L < 0$.
Example. (The effective rank)
The idea that the trace can be viewed as a smoother version of the rank has applications in statistics. It can namely occur that the performance of some estimation procedure degrades drastically when the dimensions of the ambient space are large. The effective rank of a positive semidefinite operator $\Sigma:V\to V$ is defined by $$r(\Sigma) := \frac{\operatorname{Tr}\Sigma }{\Vert \Sigma \Vert_{\operatorname{op}}}.$$ This notion allows to capture that the data effectively lives in a low-dimensional subspace of the high-dimensional ambient space which can improve the performance of the estimation problem.
For a concrete example where this notion is useful consider the problem of estimating the covariance matrix $\Sigma_X$ of some mean-zero Gaussian random vector $X$. If this vector lives in a high-dimensional space then this estimation problem may be difficult and consequently require many samples. On the other hand, in some cases it may be true that $X$ lives in some subspace $S\subseteq V$ with $\dim S \ll \dim V$ and the problem is less difficult. Consider for instance the case where only the first $3$ entries of a $10^{1000}$-dimensional vector are nonzero; although the ambient space is high-dimensional the problem of estimating its covariance matrix is not particularly difficult. The dimension of the subspace in which $X$ lives may be measured by $\operatorname{rank}\Sigma_X$. Nature is however not always so generous to ensure that $X$ lives inside of some low-dimensional subspace. A bit of noise is sufficient to destroy such nice structure. If for instance $X = Y + \mathcal{E}$ with $Y$ a Gaussian vector on some some low-dimensional subspace $S$ and $\mathcal{E}$ a small isotropic noise term then $\operatorname{rank}\Sigma_X = n$. Since the trace is continuous it will however still hold that $$r(\Sigma_X) \leq \dim S + \varepsilon$$ for some small $\varepsilon>0$. Such a vector $X$ can consequently be thought of as being only $(\dim S + \varepsilon)$-dimensional instead of $n$-dimensional as the rank would suggest. This shows that the effective rank is a more robust way to capture how high-dimensional the random vector truly is.
Acknowledgements
Some but not all interpretations in this post may be found in the answers of this question on MathOverflow. The presentation has benefited from interactions with PhDs at the Saint-Flour summer school of 2022 and at the Bielefeld ZiF summer school of 2022. Special thanks to Oskar Prośniak, Haixiao Wang and Gianluca Kosmella.
At the time of writing this post I am supported by the grant OCENW.KLEIN.324 of the research programme Open Competition Domain Science – M which is partly financed by the Dutch Research Council (NWO).