Say you have to fit a polynomial to data that is generated by another polynomial, for example. What is the process of determining what degree polynomial to use to fit that data?

**Contents**hide

#### Best Answer

I propose this be done via cross validation. In short, the data is split into K "folds". Each of the K-folds take turns acting as the test set, while the remaining K-1 are used to train a model. The model is used to predict the test set and error is recorded. The cross validated error is the average error on the K test sets. This process is repeated for each model you want to evaluate. The model with the best cv error is selected.

Each of your polynomial degrees is a separate model. Here is some code to run an example:

`import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures, StandardScaler from sklearn.linear_model import LinearRegression from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline def make_poly_features(x,degree): X = np.zeros(shape = (x.size, degree+1)) X[:,0] = 1 for i in range(degree): X[:,i+1] = np.power(x,i+1) betas = np.random.normal(0, 2, size = X.shape[1]) y = X@betas + np.random.normal(0, 4, size = x.size) return y, betas degree = np.random.randint(low = 2, high = 6) x = np.random.normal(size = 100) y, coef = make_poly_features(x,degree) plt.scatter(x,y) model = make_pipeline(StandardScaler(), PolynomialFeatures(), LinearRegression()) parms = {'polynomialfeatures__degree': np.arange(2, 6)} gscv = GridSearchCV(model, parms, cv = 10, scoring='neg_mean_squared_error') gscv.fit(x.reshape(-1,1),y) space = np.linspace(-3,3,101).reshape(-1,1) est_deg= gscv.best_params_['polynomialfeatures__degree'] plt.plot(space, gscv.predict(space), color = 'red') plt.title(f'True Degree: {degree} Estimated Degree:{est_deg}') `

I randomly generate a polynomial degree and then generate data from a polynomial of that degree. I then use some canned functions to perform the estimation. If you need background on any of these processes, I suggest you read *Introduction to statistical learning*, particularly chapter 5. The sklearn documentation is also quite useful and has some background theory.

### Similar Posts:

- Solved – splitting pipeline in sklearn
- Solved – Polynomial regression seems to give different coefficients depending on Python or R
- Solved – Constructing a model with SMOTE and sklearn pipeline
- Solved – Constructing a model with SMOTE and sklearn pipeline
- Solved – Correct way to use polynomial regression in Python