A Modern Analysis of Hutchinson’s Trace Estimator.
Abstract
The paper establishes the new state-of-art in the accuracy analysis of Hutchinson’s trace estimator. Leveraging tools that have not been previously used in this context, particularly hypercontractive inequalities and concentration properties of sub-gamma distributions, we offer an elegant and modular analysis, as well as numerically superior bounds. Besides these improvements, this work aims to better popularize the aforementioned techniques within the CS community.
Index Terms:
Randomized algorithms, Trace estimation, Hutchinson’s estimator, Monte Carlo methods, Hypercontractive inequalities, Sub-gamma distributionsI Introduction
I-A Background
Estimating the matrix trace is a problem of fundamental interest [1, 2, 3] and arises in many problems such as approximating spectral properties of matrices [4, 5, 6], solving partial differential equations [7, 8, 9, 10, 11], error evaluation in machine-learning [12], and combinatorial counting [13] to mention a few. For readers particularly interested in data science or optimization, it is of critical interest as the hessian trace stores valuable information about curvature. To give a meaningful example, consider fitting a linear classifier on the MNIST dataset [14] of hand-written digits, widely used as a benchmark and a toy problem in machine learning. Since images are in resolution and grouped in classes, the number of parameters is , and the size of the hessian matrix is . The diagonal average, proportional to the trace, equals the average eigenvalue (valuable information), whereas individual rows or columns are nearly zero (up to random noise). This is illustrated in Footnote 2.
As seen from the above example, already toy examples lead to huge matrices; for larger problems storing and inspecting such matrices is impossible. Fortunately, it is possible to estimate the trace (equal to the diagonal sum, or equivalently to the sum of eingenvalues) without knowing the full matrix. The requirement is that one can compute efficiently matrix-vector products; in the case of the hessian such products can be computed by by auto-differentiation [15] in all popular machine-learning frameworks such as Tensorflow [16], PyTorch [17] and JAX [18]. Under the assumption that an matrix is symmetric positive semi-definite (which applies to all hessians), the popular estimator due to Hutchinson [1] is
(1) |
where the components of the vector are Rademacher random variables, equal to or with equal probability. The above estimator uses only one sampled and thus is noisy (of high-variance), so it is usually boosted by averaging independent trials (for suitably large ). Formally:
(2) |
The estimators are unbiased, that is correct on average:
(3) |
and this is fairly easy to prove. The focus of this work is on a more challenging problem of understanding their concentration properties. Here we ask the following question:
What sample size guarantees the desired relative error at the confidence level of ?
Formally, for the error , and fixed , we want possibly small such that
(4) |
I-B Related Work
The estimator is already more than 30 years old [1], and although alternatives exist (such as methods based on Lanczos quadrature [19]), it is provably best in terms of variance [20] and arguably wins in simplicity, being preferred by developers of statistical/learning software [21]. Quite surprisingly, it had been used for a while without a rigorous assessment of accuracy, until the work of Avron and Toledo [3], who established the finite sample size guarantees. Their result was later improved by Roosta-Khorasani and Ascher [20], who essentially got rid of the lossy proof step involving a crude union bound. Their approach is based on the Chernoff-like direct calculations, and offers the bound .
II Contribution
II-A Summary
In this work we offer a modular, modern and more accurate analysis of Hutchinson’s estimator. The improvements upon prior works can be summarized as follows:
-
•
Modularity and Novelty of Techniques. In the first step, we explicitly link the estimator accuracy to the dispersion of the Rademacher Quadratic Chaos with a unitary matrix kernel. We then obtain a bound on such chaoses by means of Hypercontractive Inequality, an important result in Boolean Fourier Analysis [22], originally due to Aline Bonami [23]. As the final step we express this bound as the sub-gamma property (see the works of Boucheron [24]), which makes it convenient (and accurate) to conclude the final concentration result. Moreover, we state our results for the multiple-sample and one-shot estimators; this is of interest, since the base building block may be boosted differently than by averaging (see for example the median algorithm [25]).
Thus, in contrast to ad hoc calculations in prior works, we are able to use established tools from the field of high-dimensional probability and Fourier analysis. Our transparent approach opens a way for further refinements.
-
•
Superior Accuracy. With the dedicated tools we obtain bounds that are numerically better up to an order of magnitude. They are stated as elementary formulas, convenient to use in practical applications of trace estimation.
II-B Preliminaries
Below we explain the notation and definitions used when formulating our results presented in the next section.
The -th norm of a r.v. is ; it is a valid norm (e.g. it obeys the triangle inequality) when [25]. A useful tool for studying concentration is the moment generating function, defined as (a function of real parameter ). In modern high-dimensional probability one classifies distributions according to the behaviour of ; in particular we call sub-gamma with variance factor and scale when , and denote by [24]; the same formula holds for the centered gamma distribution, hence the name.
II-C Results
In what follows we assume that is non-zero symmetric positive semi-definite matrix. Then, necessarily, .
II-C1 One-Shot Estimator
Below we present the following bound on the accuracy of the estimator in (1):
Theorem 1 (Hutchison’s Estimator 1-Sample Bound).
For any integer , the relative error of Hutchison’s Estimator (1) obeys the following bound
(5) |
In particular, this implies the sub-gamma behaviour
(6) |
which in turns gives the following probability tail bound
(7) |
The proof goes along the following lines: a) we express the error in form of Rademacher chaos b) we use hypercontractive inequalities to bound its moment; this establishes the part (5) c) we plug the moment bound into Taylor’s expansion of the moment generating function and then estimate the series, arriving at the sub-gamma condition (6). By the tail properties of sub-gamma distributions, we conclude the tail bound (7).
II-C2 Multiple-Sample Estimator
Next, we move to the multiple-sample estimator in (2). We establish the following
Theorem 2 (Hutchinson’s Estimator -Sample Bound).
The relative error of -sample Hutchinson’s Estimator has the following sub-gamma behaviour
(8) |
In particular, this implies the following bound on the error tail probability, valid for any :
(9) |
This result is obtained from Theorem 1, by using extra facts on the concentration of sums of sub-gamma random variables: essentially the variance factor decreases form to for sums of length (just as the variance of iid terms).
Corollary 1 (Sample Size for Hutchinson’s Estimator).
The relative error is absolutely bounded by with probability , provided that the sample size is bigger or equal to
(10) |
This holds for any and .
II-D Numerical Comparison with Related Work
Below we demonstrate numerical improvements with respect to the prior works [3, 20]. The bounds, with exact constants, are summarized in Table I below.
Author | Sample Size |
---|---|
this work | |
[20] | |
[3] |
Furthermore, in Figure 2, we present a detailed evaluation of how the sample size depends on the error , when the confidence parameter is fixed. The setup for this experiment is the discussed hessian of a toy classifier on MNIST.
III Auxiliary Results
III-1 Matrix Theory
We will need the following fact on the decomposition of symmetric matrices [26]. Recall that a matrix is orthonormal when ( denotes transposition); in particular, the columns and rows of are of unit length.
Lemma 1 (Symmetric Matrix Factorization).
Any symmetric real matrix can be written as where is an orthonormal real matrix and is diagonal, consisting of (necessarily real) eigenvalues of .
III-2 Quadratic Chaos
We will need the following special case of the celebrated Hypercontractivity Inequality [23, 22]:
Lemma 2 ((-Hypercontractivity).
Let be a polynomial of degree in Rademacher variables (e.g. for some weights ). Then
(11) |
III-3 Sub-Gamma Random Variables
The tail probability of sub-gamma distributions can be estimated from the MGF bound by the Crammer-Chernoff method, as follows:
Lemma 3 (Sub-Gamma Tails).
Let , then
(12) |
Moreover, for sub-gamma distributions the following composition property holds for sums of independent terms:
Lemma 4 (Sub-Gamma Sums).
Let random variables be independent, then
(13) |
The proofs of these lemmas can be found in [24].
IV Proofs
IV-A Proof of Theorem 1
Define the following random variable
(14) |
then our task is to bound the concentration of .
IV-A1 Reduction to Off-Diagonal Quadratic Chaos
Since is symmetric, by Lemma 1 we have where is orthonormal and is diagonal made of the eigenvalues of . Since, in addition, the matrix is positive semi-definite, its eigenvalues are non-negative. Thus, we can write
(15) |
This can be decomposed as
(16) |
To further simplify, denote by all the eigenvalues of ; these are precisely the diagonal entries of . We note that , and (because ). Therefore we obtain
(17) |
Thus, every is a quadratic form in with zero-diagonal.
IV-A2 Bounding Quadratic Chaos
For every fixed we apply Lemma 2 to , which is a polynomial of degree in Rademacher random variables (note that the off-diagonal property guarantees the degree is exactly two).
Since the random variables indexed by tuples for are uncorrelated, we have ; but is orthonormal, so and . Thus, we obtain
(18) |
By the triangle inequality
(19) |
Finally by the definition of and the standard identity from linear algebra [26]
(20) |
Dividing the both sides of this inequality by , we complete the proof of the first part of Theorem 1.
IV-A3 Bounding Moment Generating Function
Let be the relative error of the estimator. Then by the previous result, the Taylor expansion , and the fact that we have
(21) |
Let . Then we have
(22) |
This ratio is decreasing when , and thus
(23) |
Therefore, it follows that
(24) |
which, by the standard inequality , implies
(25) |
so that, by definition, . This completes the proof of the second part of Theorem 1.
IV-A4 Concluding Concentration Properties
IV-B Proof of Theorem 2
IV-C Proof of Corollary 1
The corollary follows by equating the right-hand side of the tail bound from Theorem 2, and equating with . Solving with respect to gives the claimed formula on the sample size.
V Conclusion
This work establishes a new state-of-art bound on the classical trace estimator due to Hutchinson. The core idea is the clever usage of bounds on Rademacher Chaos and the theory of sub-gamma distributions. The results are numerically superior and of immediate interest to any problems involving trace estimation. Besides, the author hopes to contribute to better popularization of the discussed techniques, novel in the problem context, within the computer science community.
References
- [1] M. F. Hutchinson, “A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines,” Communications in Statistics-Simulation and Computation, vol. 18, no. 3, pp. 1059–1076, 1989.
- [2] Z. Bai, G. Fahey, and G. Golub, “Some large-scale matrix computation problems,” Journal of Computational and Applied Mathematics, vol. 74, no. 1-2, pp. 71–89, 1996.
- [3] H. Avron and S. Toledo, “Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix,” J. ACM, vol. 58, no. 2, Apr. 2011. [Online]. Available: https://doi.org/10.1145/1944345.1944349
- [4] L. Lin, Y. Saad, and C. Yang, “Approximating spectral densities of large matrices,” SIAM review, vol. 58, no. 1, pp. 34–65, 2016.
- [5] I. Han, D. Malioutov, H. Avron, and J. Shin, “Approximating spectral sums of large-scale matrices using stochastic chebyshev approximations,” SIAM Journal on Scientific Computing, vol. 39, no. 4, pp. A1558–A1585, 2017.
- [6] E. Di Napoli, E. Polizzi, and Y. Saad, “Efficient estimation of eigenvalue counts in an interval,” Numerical Linear Algebra with Applications, vol. 23, no. 4, pp. 674–692, 2016.
- [7] E. Haber, M. Chung, and F. Herrmann, “An effective method for parameter estimation with pde constraints with multiple right-hand sides,” SIAM Journal on Optimization, vol. 22, no. 3, pp. 739–757, 2012.
- [8] K. van den Doel and U. Ascher, “Adaptive and stochastic algorithms for eit and dc resistivity problems with piecewise constant solutions and many measurements,” SIAM J. Scient. Comput, vol. 34, p. 29, 2012.
- [9] J. Young and D. Ridzal, “An application of random projection to parameter estimation in partial differential equations,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. A2344–A2365, 2012.
- [10] F. Roosta-Khorasani, K. Van Den Doel, and U. Ascher, “Stochastic algorithms for inverse problems involving pdes and many measurements,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. S3–S22, 2014.
- [11] T. van Leeuwen, A. Y. Aravkin, and F. J. Herrmann, “Seismic waveform inversion by stochastic optimization,” International Journal of Geophysics, vol. 2011, 2011.
- [12] G. H. Golub and U. Von Matt, “Generalized cross-validation for large-scale problems,” Journal of Computational and Graphical Statistics, vol. 6, no. 1, pp. 1–34, 1997.
- [13] H. Avron, “Counting triangles in large graphs using randomized matrix trace estimation,” in Workshop on Large-scale Data Mining: Theory and Applications, vol. 10, 2010, pp. 10–9.
- [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [15] B. Christianson, “Automatic hessians by reverse accumulation,” IMA Journal of Numerical Analysis, vol. 12, no. 2, pp. 135–150, 1992.
- [16] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.
- [17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019, pp. 8026–8037.
- [18] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/google/jax
- [19] S. Ubaru, J. Chen, and Y. Saad, “Fast estimation of tr(f(a)) via stochastic lanczos quadrature,” SIAM Journal on Matrix Analysis and Applications, vol. 38, no. 4, pp. 1075–1099, 2017.
- [20] F. Roosta-Khorasani and U. Ascher, “Improved bounds on sample size for implicit matrix trace estimators,” Foundations of Computational Mathematics, vol. 15, no. 5, pp. 1187–1212, 2015.
- [21] Z. Yao, A. Gholami, K. Keutzer, and M. Mahoney, “Pyhessian: Neural networks through the lens of the hessian,” arXiv preprint arXiv:1912.07145, 2019.
- [22] R. O’Donnell, Analysis of boolean functions. Cambridge University Press, 2014.
- [23] A. Bonami, “Ensembles dans le dual de ,” Annales de l’institut Fourier, vol. 18, no. 2, pp. 193–204, 1968. [Online]. Available: http://eudml.org/doc/73956
- [24] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013. [Online]. Available: https://books.google.at/books?id=koNqWRluhP0C
- [25] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47.
- [26] D. Mello, Invitation to Linear Algebra, ser. Invitation to Linear Algebra. CRC Press, 2017, no. v. 978, nos. 1-7952. [Online]. Available: https://books.google.at/books?id=wEI-MQAACAAJ