# Solved – Variational Autoencoder and Covariance Matrix

Why does the encoder from a variational autoencoder map to a vector of means and a vector of standard deviations? Why does it not instead map to a vector of means and a covariance matrix?

Is it because we want our latent vectors to have zero covariance between components?

Contents

The diagonal covariance matrix is an explicit statement about the kind of latent representation the researcher wants the model to learn: a representation that can be modeled as independent Gaussians.

Additionally, @Firebug points out in comments that a symmetric, PD matrix can be diagonalized without any loss of information. In other words, for some symmetric, PD matrix $$A$$, we can write $$A=PDP^top$$ for $$D$$ some diagonal matrix and $$P$$ can be chosen to be orthonormal. This retains the same information in the sense that $$A$$ is rotated to have orthogonal coordinates.

Purely from the perspective of abstraction, there is no reason you must be limited to learning a latent representation that is composed of independent Gaussians. However, the computational side seems challenging.

The standard VAE encoder for a single sample produces latent parameters $$(mu, sigma)$$ its input. Then it uses the re-parameterization trick to draw random samples from that distribution. There are $$d$$ elements in each of $$mu$$ and $$sigma$$ so the total number of latent parameters is $$2d$$.

An alternative model that includes a covariance matrix would need some method to produce a covariance matrix, so the output of the encoder is $$(mu, Sigma)$$.

If your latent space has dimension $$d$$, you're doing inference on each of the $$d$$ elements of $$mu$$ and each of the $$frac{d(d+1)}{2}$$ elements of $$Sigma$$ (because $$Sigma$$ is symmetric by definition), for a total of $$frac{d(d+3)}{2}$$ elements. Any time you have more than 1 latent dimension, the covariance matrix model will have more latent parameters to learn compared to the diagonal model.

Furthermore, the multivariate normal distribution requires that $$Sigma$$ be positive definite, so we must somehow guarantee that, for each sample, we generate a PD matrix. (Using an alternative strategy, such as factorizing into standard deviations and correlation matrix $$Omega$$, i.e. $$Sigma = (sigma I) Omega (sigma I)$$, will increase the number of effective parameters without solving the PD problem, since now we must guarantee that $$Omega$$ is PD.)

Additionally, we must also be able to backprop that procedure so that the encoder weights can be updated. This may or may not be possible, depending on the strategy used to generate $$Sigma$$ and draw a deviate from the multivariate Gaussian.

These three issues — more parameters, assuring differentiability, positive definiteness — are challenging.

If you're contemplating undertaking research to overcome these challenges, that's great! But one must ask, why is this a good model? What problems does it solve which are not solved by the diagonal Gaussian VAE model, or an alternative non-Gaussian VAE model (e.g. a Dirichlet VAE)?

Rate this post