Differential entropy of Gaussian R.V. is $log_2(sigma sqrt{2pi e})$. This is dependent on $sigma$, which is the standard deviation.
If we normalize the random variable so that it has unit variance its differential entropy drops. To me this is counter-intuitive because Kolmogorov complexity of normalizing constant should be very small compared to the reduction in entropy. One can simply devise an encoder decoder which divides/multiples with the normalizing constant to recover any dataset generated by this random variable.
Probably my understanding is off. Could you please point out my flaw?
Best Answer
I'll have a go at this, though it's a bit above my head, so treat with sprinkle of salt…
You're not exactly wrong. I think that where your thought experiment falls down is that differential entropy isn't the limiting case of entropy. I'm guessing that because of this, the parallels between it and Kolmogorov complexity are lost.
Let's say we have a discrete random variable $X$. We can calculate its Shannon entropy as follows by summing over all its possible values $x_i$, $$ H(X) = -sum_i P(X=x_i) log big( P(X=x_i) big). $$
So far so boring. Now let's say that $X$ is a quantised version of a continuous random variable – say, we have density function $p()$ which generates samples from the set of real numbers, and we turn this into a histogram. We'll have a fine enough histogram that the density function is essentially linear. In that case we're going to have an entropy something like this, $$ H(X) approx -sum_{i} p(X=x_i) delta x log big( p(X=x_i) delta x big), $$ where $delta x$ is the width of our histogram bins and $x_i$ is the midpoint of each. We have a product inside that logarithm – let's separate that out and use the property of probability distributions summing to 1 to move it outside the summation, giving us $$ H(X) approx -log big( delta x big) -sum_{i} p(X=x_i) delta x log big( p(X=x_i) big). $$
If we take the limit, letting $delta x rightarrow dx$ and turning the summation into an integration, our approximation becomes exact and we get the following, $$ H(X) = -log big( dx big) -int_x p(X=x) log big( p(X=x) big)dx. $$
The term on the right hand side is the differential entropy. But look at that horrid $log big( dx big)$ term. We have to ignore it to avoid all our answers being NaN. I'm afraid it means that differential entropy is not the limiting case of Shannon entropy.
So, we lose some properties. Yes, rescaling your data changes the differential entropy – differential entropy is sort of a measure of how 'closely packed' the pdf is. If you rescale it, then this changes. Another fun property is that it can go negative, unlike Shannon entropy – try setting $sigma$ really really small and see what happens. Losing the link to Kolmogorov complexity I think is just another casualty.
Fortunately we're not entirely lost. Kullback–Leibler divergences, and by extension mutual information, are fairly well behaved as all the $delta$'s cancel out. For example, you could calculate $$ int_x p(X=x) log Bigg( frac{p(X=x)}{q(X=x)} Bigg) dx $$ where $q(X)$ is some reference distribution – say, a uniform one. This is always positive, and when you rescale the variable $X$ it changes both $p(X)$ and $q(X)$, so the results are far less severe.