# Solved – Robust estimates of the covariance matrix in the big data space

I am trying to compute the robust estimates of the covariance matrix (and also the mean) in the big data space. I am aware of FastMVE and FastMCD (Minimum Covariance Determinant and Minimum Volume Ellipsoid) algorithms, but I do not know if any implementation is already available in Apache Spark/elsewhere. Would you kindly help?

Thanks and regards,

Contents

First of all, it is important to point out that, most likely, you will be using the FastMCD[0] or FastMVE[1,p199] algorithms which are random approximations to the actual MCD and MVE estimators. The quality (and specifically, the actual robustness to outliers) of these approximations as well as the computational efforts that need to be deployed to get them chiefly depends on a parameter that determines the number of random start used by both algorithms.

Typically this number is set to 500 by default in most implantations but can be changed by the user. In order to guarantee maximal robustness of the algorithms, it should grow as \$O(2^p)\$ (see [2] for the exact formula), where \$p\$ is the number of variables in your data set. So, using the prescribed number of subset for values of \$p\$ beyond 30 is impractical and one has to accept the resulting loss in robustness (the loss in robustness can be computed using the formula in [2]).

On the other hand, for fixed \$p\$, the computational costs of obtaining the FastMCD and FastMVE fits growths sub-linearly in \$n\$ (the number of observations) thanks to a trick called random sub-sampling (see [0], section 3.3 for an explanation). Consequently, the computational costs of obtaining the FastMCD/FastMVE fits are essentially determined by \$p\$ (through the number of random starts). Now, despite being very similar in principle, the FastMVE algorithms is somewhat simpler than FastMCD and takes about a fourth of the time to compute for an equal number of starting subsets when \$n\$ is large.

–the exact MCD and MVE are not random but need on the order of \${nchoose p+1}\$ starting points which, except in cases where both \$n\$ and \$p\$ are small, is not of much practical use (though, most implementation allow you to ask for them, see the linked answer for an example of how to do this in `rrcov`).

You will find good, open source, `c` implementations of FastMCD and FastMVE in the R packages `rrcov`. Other older implementations of FastMCD exists in the Matlab library `Libra`.

Now, for the problems with comparable number of variable as yours, you might want to look at the OGK. The OGK is a robust estimator of scatter that shuns a key (and computationally expensive!) properties of FastMCD and FastMVE: affine equivariance. In return, the OGK fit is much cheaper to compute (between one and two order of magnitude depending on \$p\$ and \$n\$) and it is deterministic. An open source, `c`, implementation of OGK is also available in the `R` package `rrcov`. See [2] for an empirical comparison of all, and some more of, these methods.

I would also note that, If you are willing to use a rank \$k,k<p\$ approximation to the covariance matrix for a known value of \$k\$, a possible appealing alternative would be to use a robust PCA method.

• [0] P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
• [1] R. A. Maronna, D. Martin and V. Yohai (2006). Robust Statistics: Theory and Methods. Wiley, New York.
• [2] Hubert, M., Rousseeuw, P.J., Vakili, K. (2014). Shape bias of robust covariance estimators: an empirical study. Statistical Papers, 55, 15-28.
• [3] Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.

Rate this post