I am implementing linear regression using gradient descent algorithm in python. The closed form solution as well as gradient descent (without feature scaling) was giving satisfactory results. However, the moment i started using feature scaling (StandardScaler class in sklearn's preprocessing module), things have started to look a bit confusing.

I am following "Hands on Machine learning with scikit-learn & tensorflow" by Arelien Geron as well as tutorial available on http://scikit-learn.org/stable/modules/preprocessing.html

In the above references, it is clearly given that

- feature scaling is done when some of the features in the dataset are having large values compared to others
- and that feature scaling is done on the training data (x_train) and the same scaler is applied to testing data (x_test) as well so that test data is scaled the same way as training data
- No where was it mentioned that outputs need to be scaled as well. That is why, I have left y_train and y_test as unchanged

Now, taking care of the above facts, when I build a linear regressor model, the predicted values (y_predict) are far less compared to true values (y_test). In the first place, it looks like y_predict has been reduced by some factor. Am I missing out on something ?

The dataset that I am using is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

The first few lines of this dataset are as follows: –

` 800 0 0.3048 71.3 0.00266337 126.201 1000 0 0.3048 71.3 0.00266337 125.201 1250 0 0.3048 71.3 0.00266337 125.951 1600 0 0.3048 71.3 0.00266337 127.591 2000 0 0.3048 71.3 0.00266337 127.461 2500 0 0.3048 71.3 0.00266337 125.571 `

where the last column is output value. Clearly, the features have big difference in the values that they take(first feature is taking values in 1000s while the third feature is having values around 0.3 only). So, in my understanding, feature scaling is applicable here

now, when I build the model and compare y_test with y_predict, significant differences are obtained. First few compares as follows

`y_test y_predicted 123.965 1.730859 124.835 2.659574 125.625 0.581208 123.807 0.218661 127.127 3.279522 122.724 -3.943073 126.160 4.236322 `

As can be seen, y_predicted is significantly smaller than y_test.

I am sharing my code snippets just in case it could help you:

`scaler=preprocessing.StandardScaler().fit(x_trainn) x_train=scaler.transform(x_trainn) #x_trainn is unscaled version of training data x_test=scaler.transform(x_testt) #x_testt is unscaled test data for iteration in range(n_iterations): gradients=(2/m)*x_train.T.dot(x_train.dot(theta)-y_train) theta=theta-eta*gradients y_predict=x_test.dot(theta) out=np.column_stack((y_test, y_predict)) print(pd.DataFrame(out)) `

**Contents**hide

#### Best Answer

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

`import numpy as np import pandas as pd import math from sklearn import preprocessing dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="t",low_memory=False,header=None) apply_scaler = True # split into train 2/3 and test 1/3 rng = np.random.RandomState(42) n_rows = dat.shape[0] n_train = math.floor(0.66*n_rows) permutated_indices = rng.permutation(n_rows) train_dat = dat.loc[permutated_indices[:n_train],:] test_dat = dat.loc[permutated_indices[n_train:],:] # separate the response variable (last column) from the predictor variables x_train = train_dat.iloc[:,1:-1] y_train = (train_dat.iloc[:,-1])[:, np.newaxis] x_test = test_dat.iloc[:,1:-1] y_test = (test_dat.iloc[:,-1])[:, np.newaxis] # train # fit the scaler to predictor variables and apply it afterwards scaler = preprocessing.StandardScaler().fit(x_train) if apply_scaler: x_train = pd.DataFrame(scaler.transform(x_train)) # add constant one for the intercept parameter x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1) # fit parameters of linear regression using batch gradient descent # Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115 eta = 0.1 # learning rate n_iterations = 1000 m = x_train.shape[0] theta = rng.randn(x_train.shape[1],1) for iteration in range(n_iterations): gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train) theta = theta - eta * gradients # to apply the fitted parameters, first we have to transform the test-data in the same way # apply scaler if apply_scaler: x_test = pd.DataFrame(scaler.transform(x_test)) # add constant one for the intercept parameter x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1) # apply fitted parameters y_predict =x_test.dot(theta) # compare output out=np.column_stack((y_test, y_predict)) print(pd.DataFrame(out).head()) # root mean squared error print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean())) `

This leads to this output

` 0 1 0 120.573 127.108268 1 127.220 123.492931 2 113.045 122.393120 3 119.606 122.570836 4 131.971 127.270743 error 6.175637 `

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

### Similar Posts:

- Solved – Regarding pre-processing function StandardScaler in scikit-learn library. How to save the scaler variable for predicting new data
- Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input
- Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input
- Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input
- Solved – Sklearn: Should I create a MinMaxScaler for the target and one for the input