**Prelims**

In the article Sequential updating of conditional probabilities on directed graphical structures by Spiegelhalter and Lauritzen they give an approximation to the expectation of a logistic transformed Gaussian random variable $theta sim N(mu, sigma^2)$. This uses the Gaussian cdf function $Phi$ in the approximation

$$ exp(theta)/(1 + exp(theta)) approx Phi(theta epsilon) $$

for an appropriately chosen $epsilon$ (in their case they chose $epsilon = 0.607$). Hence

$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx int_{- infty}^{infty} Phi(theta epsilon) phi(theta | mu, sigma^2) d theta$$

where $phi$ is a Gaussian pdf function. The integral can be written as

$$ int_{infty}^{infty} Pr(U < 0 | theta) phi(theta|mu, sigma^2) dtheta $$

where $U sim N(-theta, epsilon^{-2})$ and the integral is then simply the marginal $Pr(U < 0)$. Note that as $theta sim N(mu, sigma^2)$, we have $U sim N(-mu, sigma^2 + epsilon^{-2})$. Hence

$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx Pr(U < 0) = Phi(frac{mu}{sqrt{sigma^2 + epsilon^{-2}}})$$

We can then use the initial approximation in the reverse direction to get

$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx exp(c mu)/(1 + exp(c mu)) $$

where $c = (1 + epsilon^2 sigma^2)^{-1/2}$.

**Question**

My question is, are there any approximations to the expectation of a softmax transformation of Gaussian multivariate variables. In particular, let

$$ boldsymbol{Z} sim MVN(boldsymbol{mu}, Sigma) in mathbb{R}^{n} $$

Define the $k$ activations for each discrete outcome as

$$ f_i(boldsymbol{Z}, boldsymbol{w}_i) = boldsymbol{w}_i^T boldsymbol{Z} $$

Finally define our softmax transformed activations as

$$ P_i(boldsymbol{Z}) = frac{exp(f_i(boldsymbol{Z}, boldsymbol{w}_i))}{sum_{j=1}^k exp(f_j(boldsymbol{Z}, boldsymbol{w}_j))} $$

What I want is an estimate to the expectation

$$ mathbb{E}[P_i(boldsymbol{Z})] $$

Note that in the case $k=2$, we have

$$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)) + exp(f_2(boldsymbol{Z}, boldsymbol{w}_2))} $$

Therefore

$$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)- f_2(boldsymbol{Z}, boldsymbol{w}_2)) + 1} $$

and as $f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2)$ is simply the sum of correlated Gaussian random variables, it is also Gaussian distributed. Hence we can use the initial approximation.

Can we generalise for $k > 2$?

#### Best Answer

I am sorry if I rescue a fairly old question but I was facing a very similar problem recently and I stumble upon a paper that might offer some help. The article is: "*Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables*" at https://arxiv.org/pdf/1703.00091.pdf

## Expectation of Softmax approximation

For computing the average value of a softmax mapping $pi left( mathbf{mathsf{x}} right)$ of multi-normal distributed variables $mathbf{mathsf{x}} sim mathcal{N}_D left( mathbf{mu}, mathbf{Sigma} right)$ the author provides the following approximation:

$$ mathbb{E} left[ pi^k (mathbf{mathsf{x}}) right] simeq frac{1}{2 – D + sum_{k' neq k} frac{1}{mathbb{E} left[ sigma left( x^k – x^{k'} right) right]}} $$

Where $x^k$ represents the $k$-component of the $mathbf{mathsf{x}}$ D-dimensional vector and $sigma left( x right)$ represent the one-dimensional sigmoidal function. To evaluate this formula one needs to compute the average value $mathbb{E} left[ sigma (x) right]$ for which you could use your own approximation (a very similar approximation is again provided in the aformentioned article).

This formula is based on a re-writing of the softmax formula in terms of sigmoids and starts from the $D=2$ case you mentioned where the result is "*exact*" (as much as an approximation can be) and postulate the validity of their expression for $D>2$. They validate their proposal by means of numerical validation.

### Similar Posts:

- Solved – Sufficient statistic for bivariate or multivariate normal
- Solved – Approximating the mathematical expectation of the argmax of a Gaussian random vector
- Solved – Taylor approximation of expected value of multivariate function
- Solved – Taylor approximation of expected value of multivariate function
- Solved – Computing posterior distribution of bayesian lasso