According to a prescription given on Wikipedia, I tried generating Student's t-distributed random numbers with three degrees of freedom. I also generated these numbers using numpy's in-built random number generator for t-distribution. However, I feel that the numbers that I generated using the Wikipedia prescription don't align well with the analytical form of the distribution. The following figure shows five sample histograms for each case: manual (Wikipedia) generation (upper row) and python's in-built generation (lower row). I feel that around $x = 0$, there is some issue. Am I right in my observation? If yes, what is wrong with the way I am doing it? My python code is also given below..
Each histogram contains $10000$ numbers and the green curve represents the analytical distribution
import numpy as np import matplotlib.pyplot as plt plt.style.use("seaborn") import seaborn as sns from scipy.stats import t """Generate t distributed values""" def f(x, mu): n = len(x) return np.sqrt(n) * (x.mean()-mu)/ x.std() mu = 0 df = 3 for i in range(5): plt.subplot(2,5,i+1) t_vals = [f(np.random.normal(loc = mu, size = df + 1), mu) for i in range(10000)] sns.distplot(t_vals, kde = False, norm_hist = True) x = np.linspace(-5, 5, 100) plt.plot(x, t.pdf(x, df)) plt.xlim([-5, 5]) plt.xlabel(r"$x$") if i == 0: plt.ylabel(r"$p(x)$") if i == 2: plt.title("Manually generated") for i in range(5): plt.subplot(2,5,i+6) t_vals = np.random.standard_t(df, size = 10000) sns.distplot(t_vals, kde = False, norm_hist = True) x = np.linspace(-5, 5, 100) plt.plot(x, t.pdf(x, df)) plt.xlim([-5, 5]) plt.xlabel(r"$x$") if i == 0: plt.ylabel(r"$p(x)$") if i == 2: plt.title("Generated using python") plt.tight_layout() plt.savefig("t_dists.pdf", bboxinches = "tight")
Best Answer
Short version: the problem stands with NumPy
x.std()
which does not divide by the right degrees of freedom.
Repeating the experiment in R shows no discrepancy: either by comparing the histogram with the theoretical Student's $t$ density with three degrees of freedom
or the uniformity of the transform of the sample by the theoretical Student's $t$ cdf with three degrees of freedom
or the corresponding QQ-plot:
The sample of size 10⁵ was produced as follows in R:
X=matrix(rnorm(4*1e5),ncol=4) Z=sqrt(4)*apply(X,1,mean)/apply(X,1,sd)
A Kolmogorov-Smirnov test also produces an acceptance of the null:
> ks.test(Z,"pt",df=3) One-sample Kolmogorov-Smirnov test data: Z D = 0.0039382, p-value = 0.08992 alternative hypothesis: two-sided
for one sample and
> ks.test(Z,"pt",df=3) One-sample Kolmogorov-Smirnov test data: Z D = 0.0019529, p-value = 0.8402 alternative hypothesis: two-sided
for the next.
However…, the reason is much more mundane: it just happens that NumPy does not define the standard variance in the standard (Gosset's) way! Indeed it uses instead the root of $$frac{1}{n}sum_{i=1}^n (x_i-bar{x})^2$$ which leads to a $t$ distribution inflated by $$sqrtfrac{n}{n-1}$$ and hence to the observed discrepancy:
> ks.test(sqrt(4/3)*Z,"pt",df=3) One-sample Kolmogorov-Smirnov test data: Z D = 0.030732, p-value < 2.2e-16 alternative hypothesis: two-sided
While I have no personal objection to using $n$ instead of $n-1$ in the denominator, this definition clashes with Gosset's one and hence with the definition of the Student's $t$ distribution.