Solved – How to determine what degree of polynomial to fit to data

Say you have to fit a polynomial to data that is generated by another polynomial, for example. What is the process of determining what degree polynomial to use to fit that data?

Contents

I propose this be done via cross validation. In short, the data is split into K "folds". Each of the K-folds take turns acting as the test set, while the remaining K-1 are used to train a model. The model is used to predict the test set and error is recorded. The cross validated error is the average error on the K test sets. This process is repeated for each model you want to evaluate. The model with the best cv error is selected.

Each of your polynomial degrees is a separate model. Here is some code to run an example:

``import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.linear_model import LinearRegression from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline  def make_poly_features(x,degree):      X = np.zeros(shape = (x.size, degree+1))     X[:,0] = 1     for i in range(degree):         X[:,i+1] = np.power(x,i+1)      betas = np.random.normal(0, 2, size = X.shape[1])      y = X@betas + np.random.normal(0, 4, size = x.size)      return y, betas   degree = np.random.randint(low = 2, high = 6) x = np.random.normal(size = 100) y, coef = make_poly_features(x,degree)  plt.scatter(x,y)  model = make_pipeline(StandardScaler(), PolynomialFeatures(), LinearRegression())  parms = {'polynomialfeatures__degree': np.arange(2, 6)}  gscv = GridSearchCV(model, parms, cv = 10, scoring='neg_mean_squared_error') gscv.fit(x.reshape(-1,1),y)  space = np.linspace(-3,3,101).reshape(-1,1)  est_deg= gscv.best_params_['polynomialfeatures__degree']  plt.plot(space, gscv.predict(space), color = 'red') plt.title(f'True Degree: {degree}  Estimated Degree:{est_deg}') ``

I randomly generate a polynomial degree and then generate data from a polynomial of that degree. I then use some canned functions to perform the estimation. If you need background on any of these processes, I suggest you read Introduction to statistical learning, particularly chapter 5. The sklearn documentation is also quite useful and has some background theory.

Rate this post