Prelims
In the article Sequential updating of conditional probabilities on directed graphical structures by Spiegelhalter and Lauritzen they give an approximation to the expectation of a logistic transformed Gaussian random variable $theta sim N(mu, sigma^2)$. This uses the Gaussian cdf function $Phi$ in the approximation
$$ exp(theta)/(1 + exp(theta)) approx Phi(theta epsilon) $$
for an appropriately chosen $epsilon$ (in their case they chose $epsilon = 0.607$). Hence
$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx int_{- infty}^{infty} Phi(theta epsilon) phi(theta | mu, sigma^2) d theta$$
where $phi$ is a Gaussian pdf function. The integral can be written as
$$ int_{infty}^{infty} Pr(U < 0 | theta) phi(theta|mu, sigma^2) dtheta $$
where $U sim N(-theta, epsilon^{-2})$ and the integral is then simply the marginal $Pr(U < 0)$. Note that as $theta sim N(mu, sigma^2)$, we have $U sim N(-mu, sigma^2 + epsilon^{-2})$. Hence
$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx Pr(U < 0) = Phi(frac{mu}{sqrt{sigma^2 + epsilon^{-2}}})$$
We can then use the initial approximation in the reverse direction to get
$$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx exp(c mu)/(1 + exp(c mu)) $$
where $c = (1 + epsilon^2 sigma^2)^{-1/2}$.
Question
My question is, are there any approximations to the expectation of a softmax transformation of Gaussian multivariate variables. In particular, let
$$ boldsymbol{Z} sim MVN(boldsymbol{mu}, Sigma) in mathbb{R}^{n} $$
Define the $k$ activations for each discrete outcome as
$$ f_i(boldsymbol{Z}, boldsymbol{w}_i) = boldsymbol{w}_i^T boldsymbol{Z} $$
Finally define our softmax transformed activations as
$$ P_i(boldsymbol{Z}) = frac{exp(f_i(boldsymbol{Z}, boldsymbol{w}_i))}{sum_{j=1}^k exp(f_j(boldsymbol{Z}, boldsymbol{w}_j))} $$
What I want is an estimate to the expectation
$$ mathbb{E}[P_i(boldsymbol{Z})] $$
Note that in the case $k=2$, we have
$$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)) + exp(f_2(boldsymbol{Z}, boldsymbol{w}_2))} $$
Therefore
$$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)- f_2(boldsymbol{Z}, boldsymbol{w}_2)) + 1} $$
and as $f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2)$ is simply the sum of correlated Gaussian random variables, it is also Gaussian distributed. Hence we can use the initial approximation.
Can we generalise for $k > 2$?
Best Answer
I am sorry if I rescue a fairly old question but I was facing a very similar problem recently and I stumble upon a paper that might offer some help. The article is: "Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables" at https://arxiv.org/pdf/1703.00091.pdf
Expectation of Softmax approximation
For computing the average value of a softmax mapping $pi left( mathbf{mathsf{x}} right)$ of multi-normal distributed variables $mathbf{mathsf{x}} sim mathcal{N}_D left( mathbf{mu}, mathbf{Sigma} right)$ the author provides the following approximation:
$$ mathbb{E} left[ pi^k (mathbf{mathsf{x}}) right] simeq frac{1}{2 – D + sum_{k' neq k} frac{1}{mathbb{E} left[ sigma left( x^k – x^{k'} right) right]}} $$
Where $x^k$ represents the $k$-component of the $mathbf{mathsf{x}}$ D-dimensional vector and $sigma left( x right)$ represent the one-dimensional sigmoidal function. To evaluate this formula one needs to compute the average value $mathbb{E} left[ sigma (x) right]$ for which you could use your own approximation (a very similar approximation is again provided in the aformentioned article).
This formula is based on a re-writing of the softmax formula in terms of sigmoids and starts from the $D=2$ case you mentioned where the result is "exact" (as much as an approximation can be) and postulate the validity of their expression for $D>2$. They validate their proposal by means of numerical validation.
Similar Posts:
- Solved – Sufficient statistic for bivariate or multivariate normal
- Solved – Approximating the mathematical expectation of the argmax of a Gaussian random vector
- Solved – Taylor approximation of expected value of multivariate function
- Solved – Taylor approximation of expected value of multivariate function
- Solved – Computing posterior distribution of bayesian lasso