Solved – Why does scipy use Wald Statistic + t-test as opposed to Wald Statistic + Wald test for linear regression

When performing linear regression, why does scipy use Wald Statistic followed by a t-test, as opposed to Wald Statistic followed by a Wald test?

The following code performs linear regression on two random 5-element vectors using scipy:

import scipy.stats import random  # Draw some random X and Y sampleSize = 5 X = [random.uniform(0, 1) for i in range(sampleSize)] Y = [random.uniform(0, 1) for i in range(sampleSize)]  # named tuple for scipy's linear regression resultScipy = scipy.stats.linregress(X, Y)  print(resultScipy) 

Example output is:

LinregressResult(slope=-0.36654925096390012, intercept=0.67985896267369861, rvalue=-0.32357812225448918, pvalue=0.59531436421063166, stderr=0.61883685654551934) 

I very much wanted to know how scipy obtains the pvalue, in this case 0.59531436421063166. Note that the scipy's documentation isn't particularly enlightening:

p-value: float

two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.

After some python heavy-lifting I've written some code, which fully reproduces scipy's behaviour:

import random import math import collections import scipy.stats import scipy.stats.distributions as distributions import numpy as np import sys  # Draw some random X and Y sampleSize = 5 X = [random.uniform(0, 1) for i in range(sampleSize)] Y = [random.uniform(0, 1) for i in range(sampleSize)]  # Compute their average avgX = sum(X)/sampleSize avgY = sum(Y)/sampleSize  # Partial steps to compute estimators of linear regression parameters. XDiff = [X_i - avgX for X_i in X] XDiffSquared = [i*i for i in XDiff] YDiff = [Y_i - avgY for Y_i in Y]  # B1 is the estimator of slope. # B0 is the estimator of intercept. # r is the estimator of Y given X. B1 = sum(x * y for x, y in zip(XDiff, YDiff)) / sum(XDiffSquared) B0 = avgY - B1*avgX r = lambda x: B0 + B1*x  # Partial steps to compute Wald Statistic. errs = [y - r(x) for x, y in zip(X, Y)] errStd = math.sqrt((1/(sampleSize-2))*(sum([err**2 for err in errs]))) XStd = math.sqrt((1/(sampleSize))*sum([diff**2 for diff in XDiff])) stdB1 = errStd / (XStd * math.sqrt(sampleSize))  # Wald Statistic. W = (B1 - 0)/stdB1  # pvalue of Wald Test of B1 = 0. pvalueWald = 2*scipy.stats.norm.cdf(-abs(W))  # pvalue of T test of B1 = 0. pvalueT = 2*distributions.t.sf(abs(W), sampleSize - 2)  # named tuple for our linear regression, with p-value coming from Wald Statistic LinregressResult = collections.namedtuple("LinregressResult", ["slope", "intercept", "rvalue", "pvalue", "stderr"]) resultWald = LinregressResult(slope=B1, intercept=B0, rvalue=None, pvalue=pvalueWald, stderr=stdB1)  LinregressResult = collections.namedtuple("LinregressResult", ["slope", "intercept", "rvalue", "pvalue", "stderr"]) resultT = LinregressResult(slope=B1, intercept=B0, rvalue=None, pvalue=pvalueT, stderr=stdB1)  # named tuple for scipy's linear regression resultScipy = scipy.stats.linregress(X, Y)  print(resultWald) print(resultT) print(resultScipy) 

The code computes Wald Statistic on the estimator of the slope (gradient), followed by a (sampleSize - 2) degree of freedom t-test.

Why is the p-value computed in such way? My go-to statistics textbook by Larry Wasserman, suggests that I compute the p-value of linear regression in a different way, by computing Wald Statistic, followed by a Wald test.

The two tests give very different results when the sample size is small and the slope is extreme, with Wald's test p-values being consistently lower than t-test's (even by orders of magnitude).

plot of Wald test vs t-test
enter image description here

Which approach is better, more correct, and which one should I use?

This issue was discussed at scipy's issue tracker:

It turns out that using Wald Statistic increases the type 1 error rate, as seen in the plots above.

Similar Posts:

Rate this post

Leave a Comment