Why does the encoder from a variational autoencoder map to a vector of means and a vector of standard deviations? Why does it not instead map to a vector of means and a **covariance matrix**?

Is it because we want our latent vectors to have zero covariance between components?

**Contents**hide

#### Best Answer

The diagonal covariance matrix is an explicit statement about the kind of latent representation the researcher wants the model to learn: a representation that can be modeled as independent Gaussians.

Additionally, @Firebug points out in comments that a symmetric, PD matrix can be diagonalized without any loss of information. In other words, for some symmetric, PD matrix $A$, we can write $A=PDP^top$ for $D$ some diagonal matrix and $P$ can be chosen to be orthonormal. This retains the same information in the sense that $A$ is rotated to have orthogonal coordinates.

Purely from the perspective of abstraction, there is no reason you must be limited to learning a latent representation that is composed of independent Gaussians. However, the computational side seems challenging.

The standard VAE encoder for a single sample produces latent parameters $(mu, sigma)$ its input. Then it uses the re-parameterization trick to draw random samples from that distribution. There are $d$ elements in each of $mu$ and $sigma$ so the total number of latent parameters is $2d$.

An alternative model that includes a covariance matrix would need some method to produce a covariance matrix, so the output of the encoder is $(mu, Sigma)$.

If your latent space has dimension $d$, you're doing inference on each of the $d$ elements of $mu$ and each of the $frac{d(d+1)}{2}$ elements of $Sigma$ (because $Sigma$ is symmetric by definition), for a total of $frac{d(d+3)}{2}$ elements. Any time you have more than 1 latent dimension, the covariance matrix model will have more latent parameters to learn compared to the diagonal model.

Furthermore, the multivariate normal distribution requires that $Sigma$ be positive definite, so we must somehow guarantee that, for each sample, we generate a PD matrix. (Using an alternative strategy, such as factorizing into standard deviations and correlation matrix $Omega$, i.e. $Sigma = (sigma I) Omega (sigma I)$, will increase the number of effective parameters without solving the PD problem, since now we must guarantee that $Omega$ is PD.)

Additionally, we must also be able to backprop that procedure so that the encoder weights can be updated. This may or may not be possible, depending on the strategy used to generate $Sigma$ and draw a deviate from the multivariate Gaussian.

These three issues — more parameters, assuring differentiability, positive definiteness — are challenging.

If you're contemplating undertaking research to overcome these challenges, that's great! But one must ask, why is this a good model? What problems does it solve which are not solved by the diagonal Gaussian VAE model, or an alternative non-Gaussian VAE model (e.g. a Dirichlet VAE)?

### Similar Posts:

- Solved – Variational Autoencoder and Covariance Matrix
- Solved – the standard error of the mean of multivariate normal distribution
- Solved – the standard error of the mean of multivariate normal distribution
- Solved – What does the covariance matrix of a Gaussian Process look like
- Solved – the difference between the anti-image covariance and the anti-image correlation