Solved – Generating t-distributed random numbers: lesser numbers near zero.

According to a prescription given on Wikipedia, I tried generating Student's t-distributed random numbers with three degrees of freedom. I also generated these numbers using numpy's in-built random number generator for t-distribution. However, I feel that the numbers that I generated using the Wikipedia prescription don't align well with the analytical form of the distribution. The following figure shows five sample histograms for each case: manual (Wikipedia) generation (upper row) and python's in-built generation (lower row). I feel that around $x = 0$, there is some issue. Am I right in my observation? If yes, what is wrong with the way I am doing it? My python code is also given below..

Each histogram contains $10000$ numbers and the green curve represents the analytical distribution

enter image description here

import numpy as np import matplotlib.pyplot as plt plt.style.use("seaborn") import seaborn as sns from scipy.stats import t  """Generate t distributed values""" def f(x, mu):     n = len(x)     return np.sqrt(n) * (x.mean()-mu)/ x.std()  mu = 0 df = 3   for i in range(5):     plt.subplot(2,5,i+1)     t_vals = [f(np.random.normal(loc = mu, size = df + 1), mu) for i in range(10000)]     sns.distplot(t_vals, kde = False, norm_hist = True)     x = np.linspace(-5, 5, 100)     plt.plot(x, t.pdf(x, df))     plt.xlim([-5, 5])     plt.xlabel(r"$x$")     if i == 0:         plt.ylabel(r"$p(x)$")     if i == 2:         plt.title("Manually generated")  for i in range(5):     plt.subplot(2,5,i+6)     t_vals = np.random.standard_t(df, size = 10000)     sns.distplot(t_vals, kde = False, norm_hist = True)     x = np.linspace(-5, 5, 100)     plt.plot(x, t.pdf(x, df))     plt.xlim([-5, 5])     plt.xlabel(r"$x$")     if i == 0:         plt.ylabel(r"$p(x)$")     if i == 2:         plt.title("Generated using python")  plt.tight_layout() plt.savefig("t_dists.pdf", bboxinches = "tight") 

Short version: the problem stands with NumPy x.std() which does not divide by the right degrees of freedom.

Repeating the experiment in R shows no discrepancy: either by comparing the histogram with the theoretical Student's $t$ density with three degrees of freedom

enter image description here

or the uniformity of the transform of the sample by the theoretical Student's $t$ cdf with three degrees of freedom

enter image description here

or the corresponding QQ-plot:

enter image description here

The sample of size 10⁵ was produced as follows in R:

X=matrix(rnorm(4*1e5),ncol=4) Z=sqrt(4)*apply(X,1,mean)/apply(X,1,sd) 

A Kolmogorov-Smirnov test also produces an acceptance of the null:

> ks.test(Z,"pt",df=3)      One-sample Kolmogorov-Smirnov test  data:  Z D = 0.0039382, p-value = 0.08992 alternative hypothesis: two-sided 

for one sample and

>  ks.test(Z,"pt",df=3)      One-sample Kolmogorov-Smirnov test  data:  Z  D = 0.0019529, p-value = 0.8402  alternative hypothesis: two-sided 

for the next.

However…, the reason is much more mundane: it just happens that NumPy does not define the standard variance in the standard (Gosset's) way! Indeed it uses instead the root of $$frac{1}{n}sum_{i=1}^n (x_i-bar{x})^2$$ which leads to a $t$ distribution inflated by $$sqrtfrac{n}{n-1}$$ and hence to the observed discrepancy:

> ks.test(sqrt(4/3)*Z,"pt",df=3)          One-sample Kolmogorov-Smirnov test  data:  Z D = 0.030732, p-value < 2.2e-16 alternative hypothesis: two-sided 

enter image description here

While I have no personal objection to using $n$ instead of $n-1$ in the denominator, this definition clashes with Gosset's one and hence with the definition of the Student's $t$ distribution.

Similar Posts:

Rate this post

Leave a Comment