# Solved – Generating t-distributed random numbers: lesser numbers near zero.

According to a prescription given on Wikipedia, I tried generating Student's t-distributed random numbers with three degrees of freedom. I also generated these numbers using numpy's in-built random number generator for t-distribution. However, I feel that the numbers that I generated using the Wikipedia prescription don't align well with the analytical form of the distribution. The following figure shows five sample histograms for each case: manual (Wikipedia) generation (upper row) and python's in-built generation (lower row). I feel that around \$x = 0\$, there is some issue. Am I right in my observation? If yes, what is wrong with the way I am doing it? My python code is also given below..

Each histogram contains \$10000\$ numbers and the green curve represents the analytical distribution

``import numpy as np import matplotlib.pyplot as plt plt.style.use("seaborn") import seaborn as sns from scipy.stats import t  """Generate t distributed values""" def f(x, mu):     n = len(x)     return np.sqrt(n) * (x.mean()-mu)/ x.std()  mu = 0 df = 3   for i in range(5):     plt.subplot(2,5,i+1)     t_vals = [f(np.random.normal(loc = mu, size = df + 1), mu) for i in range(10000)]     sns.distplot(t_vals, kde = False, norm_hist = True)     x = np.linspace(-5, 5, 100)     plt.plot(x, t.pdf(x, df))     plt.xlim([-5, 5])     plt.xlabel(r"\$x\$")     if i == 0:         plt.ylabel(r"\$p(x)\$")     if i == 2:         plt.title("Manually generated")  for i in range(5):     plt.subplot(2,5,i+6)     t_vals = np.random.standard_t(df, size = 10000)     sns.distplot(t_vals, kde = False, norm_hist = True)     x = np.linspace(-5, 5, 100)     plt.plot(x, t.pdf(x, df))     plt.xlim([-5, 5])     plt.xlabel(r"\$x\$")     if i == 0:         plt.ylabel(r"\$p(x)\$")     if i == 2:         plt.title("Generated using python")  plt.tight_layout() plt.savefig("t_dists.pdf", bboxinches = "tight") ``
Contents

Short version: the problem stands with NumPy `x.std()` which does not divide by the right degrees of freedom.

Repeating the experiment in R shows no discrepancy: either by comparing the histogram with the theoretical Student's \$t\$ density with three degrees of freedom

or the uniformity of the transform of the sample by the theoretical Student's \$t\$ cdf with three degrees of freedom

or the corresponding QQ-plot:

The sample of size 10⁵ was produced as follows in R:

``X=matrix(rnorm(4*1e5),ncol=4) Z=sqrt(4)*apply(X,1,mean)/apply(X,1,sd) ``

A Kolmogorov-Smirnov test also produces an acceptance of the null:

``> ks.test(Z,"pt",df=3)      One-sample Kolmogorov-Smirnov test  data:  Z D = 0.0039382, p-value = 0.08992 alternative hypothesis: two-sided ``

for one sample and

``>  ks.test(Z,"pt",df=3)      One-sample Kolmogorov-Smirnov test  data:  Z  D = 0.0019529, p-value = 0.8402  alternative hypothesis: two-sided ``

for the next.

However…, the reason is much more mundane: it just happens that NumPy does not define the standard variance in the standard (Gosset's) way! Indeed it uses instead the root of \$\$frac{1}{n}sum_{i=1}^n (x_i-bar{x})^2\$\$ which leads to a \$t\$ distribution inflated by \$\$sqrtfrac{n}{n-1}\$\$ and hence to the observed discrepancy:

``> ks.test(sqrt(4/3)*Z,"pt",df=3)          One-sample Kolmogorov-Smirnov test  data:  Z D = 0.030732, p-value < 2.2e-16 alternative hypothesis: two-sided ``

While I have no personal objection to using \$n\$ instead of \$n-1\$ in the denominator, this definition clashes with Gosset's one and hence with the definition of the Student's \$t\$ distribution.

Rate this post