Solved – feature scaling giving reduced output (linear regression using gradient descent)

I am implementing linear regression using gradient descent algorithm in python. The closed form solution as well as gradient descent (without feature scaling) was giving satisfactory results. However, the moment i started using feature scaling (StandardScaler class in sklearn's preprocessing module), things have started to look a bit confusing.

I am following "Hands on Machine learning with scikit-learn & tensorflow" by Arelien Geron as well as tutorial available on http://scikit-learn.org/stable/modules/preprocessing.html

In the above references, it is clearly given that

  1. feature scaling is done when some of the features in the dataset are having large values compared to others
  2. and that feature scaling is done on the training data (x_train) and the same scaler is applied to testing data (x_test) as well so that test data is scaled the same way as training data
  3. No where was it mentioned that outputs need to be scaled as well. That is why, I have left y_train and y_test as unchanged

Now, taking care of the above facts, when I build a linear regressor model, the predicted values (y_predict) are far less compared to true values (y_test). In the first place, it looks like y_predict has been reduced by some factor. Am I missing out on something ?

The dataset that I am using is available at http://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

The first few lines of this dataset are as follows: –

    800 0   0.3048  71.3    0.00266337  126.201    1000 0   0.3048  71.3    0.00266337  125.201    1250 0   0.3048  71.3    0.00266337  125.951    1600 0   0.3048  71.3    0.00266337  127.591    2000 0   0.3048  71.3    0.00266337  127.461    2500 0   0.3048  71.3    0.00266337  125.571 

where the last column is output value. Clearly, the features have big difference in the values that they take(first feature is taking values in 1000s while the third feature is having values around 0.3 only). So, in my understanding, feature scaling is applicable here

now, when I build the model and compare y_test with y_predict, significant differences are obtained. First few compares as follows

y_test       y_predicted      123.965  1.730859     124.835  2.659574     125.625  0.581208     123.807  0.218661     127.127  3.279522     122.724 -3.943073     126.160  4.236322 

As can be seen, y_predicted is significantly smaller than y_test.

I am sharing my code snippets just in case it could help you:

scaler=preprocessing.StandardScaler().fit(x_trainn) x_train=scaler.transform(x_trainn) #x_trainn is unscaled version of training data x_test=scaler.transform(x_testt) #x_testt is unscaled test data   for iteration in range(n_iterations):         gradients=(2/m)*x_train.T.dot(x_train.dot(theta)-y_train)         theta=theta-eta*gradients   y_predict=x_test.dot(theta)  out=np.column_stack((y_test, y_predict)) print(pd.DataFrame(out)) 

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np import pandas as pd import math from sklearn import preprocessing  dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="t",low_memory=False,header=None)  apply_scaler = True  # split into train 2/3 and test 1/3 rng = np.random.RandomState(42)  n_rows = dat.shape[0] n_train = math.floor(0.66*n_rows)  permutated_indices = rng.permutation(n_rows)  train_dat = dat.loc[permutated_indices[:n_train],:] test_dat =  dat.loc[permutated_indices[n_train:],:]  # separate the response variable (last column) from the predictor variables x_train = train_dat.iloc[:,1:-1] y_train = (train_dat.iloc[:,-1])[:, np.newaxis]  x_test = test_dat.iloc[:,1:-1] y_test = (test_dat.iloc[:,-1])[:, np.newaxis]  # train # fit the scaler to predictor variables and apply it afterwards scaler = preprocessing.StandardScaler().fit(x_train)  if apply_scaler:     x_train = pd.DataFrame(scaler.transform(x_train))  # add constant one for the intercept parameter x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)  # fit parameters of linear regression using batch gradient descent # Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115 eta = 0.1 # learning rate n_iterations = 1000 m = x_train.shape[0] theta = rng.randn(x_train.shape[1],1)  for iteration in range(n_iterations):     gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)     theta = theta - eta * gradients  # to apply the fitted parameters, first we have to transform the test-data in the same way # apply scaler if apply_scaler:     x_test = pd.DataFrame(scaler.transform(x_test))  # add constant one for the intercept parameter x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)  # apply fitted parameters y_predict =x_test.dot(theta)  # compare output out=np.column_stack((y_test, y_predict)) print(pd.DataFrame(out).head()) # root mean squared error print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean())) 

This leads to this output

         0           1 0  120.573  127.108268 1  127.220  123.492931 2  113.045  122.393120 3  119.606  122.570836 4  131.971  127.270743 error 6.175637 

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Similar Posts:

Rate this post

Leave a Comment