Solved – sklearn Linear Regression vs Batch Gradient Descent

tldr: Why would sklearn LinearRegression give a different result than gradient descent?

My understanding is that LinearRegression is computing the closed form solution for linear regression (described well here Why use gradient descent for linear regression, when a closed-form math solution is available?). LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used. I have a small data set and wanted to use Batch Gradient Descent (self written) as an intermediate step for my own edification.

I get different regression weights using LinearRegression and Batch Gradient Descent. Shouldn't the solution be unique?

Code:

import numpy as np import pandas as pd from sklearn import linear_model import matplotlib.pyplot as plt  data=pd.read_csv(r'') #Data set attached X=data[['Size','Floor','Broadband  Rate']] y=data['Rental Price']  #Sklearn Linear Regression ols=linear_model.LinearRegression(fit_intercept=True, normalize=False) LR=ols.fit(X,y) Res_LR=y.values-LR.predict(X) #Residuals  print('Intercept', LR.intercept_, 'Weights', LR.coef_)  #Batch Gradient Descent def error_delta(x,y,p,wn):     total=0     row,column=np.shape(x)      for i in range(0,row):         if wn!=0:total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*x[i,wn-1]         else: total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*1         return total  def weight_update(x,y,p,alpha):     old=p     new=np.zeros(len(p))     for i in range(0,len(p)): new[i]=old[i]+alpha*error_delta(x,y,old,i)     return new  weight=[-.146,.185,-.044,.119] #random starting conditions alpha=.00000002 #learning rate  for i in range(0,500): #Seems to have converged by 100     weight=weight_update(X.values,y.values,weight,alpha)  Res_BGD=np.zeros(len(X.values)) for i in range(0,len(X.values)): Res_BGD[i]=y.values[i]-(weight[0]+np.dot(weight[1::],X.values[i,:]))  print('Inercept', weight[0], 'Weights', weight[1:len(weight)])   plt.plot(np.arange(0,len(X.values)),Res_LR,color='b') plt.plot(np.arange(0,len(X.values)), Res_BGD,color='g') plt.legend(['Res LR', 'Res BGD']) plt.show() 

The data set is below (10 points)

Size,Floor,Broadband  Rate,Energy Rating,Rental Price "  500   ","    4    ","    8    ","    C ","   320 " "  550   ","    7    ","    50   ","    A ","   380 " "  620   ","    9    ","    7    ","    A ","   400 " "  630   ","    5    ","    24   ","    B ","   390 " "  665   ","    8    ","    100  ","    C ","   385 " "  700   ","    4    ","    8    ","    B ","   410 " "  770   ","    10   ","    7    ","    B ","   480 " "  880   ","    12   ","    50   ","    A ","   600 " "  920   ","    14   ","    8    ","    C ","   570 " "  1000  ","    9    ","    24   ","    B ","   620 " 

When you plot the residuals, the performance is comparable despite very different weights

SklearnLR Intercept 19.5615588974 Weights [ 0.54873985 4.96354677 -0.06209515]  BGD Inercept -0.145402077197 Weights [ 0.62549182 -0.0344091 0.11473203] 

Thoughts? Also, if there's any programming feedback, I'm open to more efficient ways to code this as well.

There are some problems in your question.

  1. Can you make sure the iterative solver converge?

Note that, we can solve linear regression / minimizing squared loss in different ways. My experience of using python scikit-learn is the default set up usually will not give the result that converge. It is possible that we are limiting number of iterations in iterative solver, and stopped early. If we stop early, it is half done work, so it will not be as same as the optimal solution you got from other algorithms.

  1. I would not agree on

LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used.

If we are using QR decomposition, even data is on the level of millions (hopefully this is large enough), as well as number of features is not big, we can solve it in second. Check this R code. You may surprised that we can solve a linear regression on million data points with less than 1 sec.

x=matrix(runif(2e6),ncol=2) y=runif(1e6) stime = proc.time() lm(y~x) print(proc.time()-stime) 

Similar Posts:

Rate this post

Leave a Comment