# Solved – Expectation of the softmax transform for Gaussian multivariate variables

Prelims

In the article Sequential updating of conditional probabilities on directed graphical structures by Spiegelhalter and Lauritzen they give an approximation to the expectation of a logistic transformed Gaussian random variable \$theta sim N(mu, sigma^2)\$. This uses the Gaussian cdf function \$Phi\$ in the approximation

\$\$ exp(theta)/(1 + exp(theta)) approx Phi(theta epsilon) \$\$

for an appropriately chosen \$epsilon\$ (in their case they chose \$epsilon = 0.607\$). Hence

\$\$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx int_{- infty}^{infty} Phi(theta epsilon) phi(theta | mu, sigma^2) d theta\$\$

where \$phi\$ is a Gaussian pdf function. The integral can be written as

\$\$ int_{infty}^{infty} Pr(U < 0 | theta) phi(theta|mu, sigma^2) dtheta \$\$

where \$U sim N(-theta, epsilon^{-2})\$ and the integral is then simply the marginal \$Pr(U < 0)\$. Note that as \$theta sim N(mu, sigma^2)\$, we have \$U sim N(-mu, sigma^2 + epsilon^{-2})\$. Hence

\$\$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx Pr(U < 0) = Phi(frac{mu}{sqrt{sigma^2 + epsilon^{-2}}})\$\$

We can then use the initial approximation in the reverse direction to get

\$\$ mathbb{E} left [ exp(theta)/(1 + exp(theta))right ] approx exp(c mu)/(1 + exp(c mu)) \$\$

where \$c = (1 + epsilon^2 sigma^2)^{-1/2}\$.

Question

My question is, are there any approximations to the expectation of a softmax transformation of Gaussian multivariate variables. In particular, let

\$\$ boldsymbol{Z} sim MVN(boldsymbol{mu}, Sigma) in mathbb{R}^{n} \$\$

Define the \$k\$ activations for each discrete outcome as

\$\$ f_i(boldsymbol{Z}, boldsymbol{w}_i) = boldsymbol{w}_i^T boldsymbol{Z} \$\$

Finally define our softmax transformed activations as
\$\$ P_i(boldsymbol{Z}) = frac{exp(f_i(boldsymbol{Z}, boldsymbol{w}_i))}{sum_{j=1}^k exp(f_j(boldsymbol{Z}, boldsymbol{w}_j))} \$\$

What I want is an estimate to the expectation
\$\$ mathbb{E}[P_i(boldsymbol{Z})] \$\$

Note that in the case \$k=2\$, we have

\$\$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)) + exp(f_2(boldsymbol{Z}, boldsymbol{w}_2))} \$\$

Therefore

\$\$ P_1(boldsymbol{Z}) = frac{exp(f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2))}{ exp(f_1(boldsymbol{Z}, boldsymbol{w}_1)- f_2(boldsymbol{Z}, boldsymbol{w}_2)) + 1} \$\$

and as \$f_1(boldsymbol{Z}, boldsymbol{w}_1) – f_2(boldsymbol{Z}, boldsymbol{w}_2)\$ is simply the sum of correlated Gaussian random variables, it is also Gaussian distributed. Hence we can use the initial approximation.

Can we generalise for \$k > 2\$?

Contents

I am sorry if I rescue a fairly old question but I was facing a very similar problem recently and I stumble upon a paper that might offer some help. The article is: "Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables" at https://arxiv.org/pdf/1703.00091.pdf

## Expectation of Softmax approximation

For computing the average value of a softmax mapping $$pi left( mathbf{mathsf{x}} right)$$ of multi-normal distributed variables $$mathbf{mathsf{x}} sim mathcal{N}_D left( mathbf{mu}, mathbf{Sigma} right)$$ the author provides the following approximation:

$$mathbb{E} left[ pi^k (mathbf{mathsf{x}}) right] simeq frac{1}{2 – D + sum_{k' neq k} frac{1}{mathbb{E} left[ sigma left( x^k – x^{k'} right) right]}}$$

Where $$x^k$$ represents the $$k$$-component of the $$mathbf{mathsf{x}}$$ D-dimensional vector and $$sigma left( x right)$$ represent the one-dimensional sigmoidal function. To evaluate this formula one needs to compute the average value $$mathbb{E} left[ sigma (x) right]$$ for which you could use your own approximation (a very similar approximation is again provided in the aformentioned article).

This formula is based on a re-writing of the softmax formula in terms of sigmoids and starts from the $$D=2$$ case you mentioned where the result is "exact" (as much as an approximation can be) and postulate the validity of their expression for $$D>2$$. They validate their proposal by means of numerical validation.

Rate this post