I'm trying Sklearn's RANSAC algorithm implementation to produce a simple linear regression fit with built-in outlier detection/rejection (code below). This is what my raw data looks like:
Even using max_trials=10000000
(the Maximum number of iterations for random sample selection) the estimated coefficients (slope, intercept) and the R^2 & RMSE statistics vary quite a bit with each new run of the same exact raw data.
I would've assumed the results converged if one gave it enough iterations, but apparently this is not true. Are 10 million iterations not enough? Is my data too noisy? Am I missing some parameter I should fix in RANSACRegressor
or is this just the nature of the RANSAC algorithm?
import numpy as np from sklearn import linear_model from sklearn.metrics import r2_score, mean_squared_error # Data x = [-0.099000000000000005, 0.094, 1.3129999999999999, 0.096000000000000002, 0.088999999999999996, -0.002, 0.11, 2.1440000000000001, -0.0040000000000000001, 0.245, 0.66600000000000004, 1.0029999999999999, 0.71899999999999997, 1.9359999999999999, 0.098000000000000004, 0.39400000000000002, 0.35699999999999998, 1.284, -0.027, 0.123, 1.837, 0.85299999999999998, 1.0669999999999999, 0.126, 0.078, -0.001, 0.318, 0.39900000000000002, 1.4259999999999999, 1.4970000000000001, 1.0860000000000001, -0.10000000000000001, 1.296, 0.44700000000000001, 0.23599999999999999, 0.26300000000000001, 1.5589999999999999, 1.1200000000000001, 1.2390000000000001, 1.0920000000000001, 1.1779999999999999, 1.915, -0.10299999999999999, 0.156, 0.53200000000000003, 1.871, -1.2170000000000001, 0.38] y = [-4.5064700000000002, -3.8593700000000002, -3.5898699999999977, -3.7799699999999987, -3.7595699999999983, -3.8206699999999998, -3.7413699999999981, -3.0731699999999975, -3.7439699999999974, -3.6922700000000006, -3.7251700000000003, -3.6609700000000007, -3.8754699999999982, -3.1616699999999973, -3.8635699999999993, -3.8146700000000013, -3.6913699999999992, -3.5147699999999986, -3.852369999999997, -3.8125699999999991, -0.58567000000000036, -3.81447, -3.6871699999999983, -3.6793700000000005, -3.8035699999999988, -3.8506699999999991, -3.6703700000000019, -3.7025699999999979, -2.6628699999999981, -3.5626700000000007, -3.0997699999999959, 1.2492300000000007, -3.7071510000000014, -3.9486509999999981, -3.7250509999999988, -3.9179509999999986, -3.5861509999999974, -3.8082509999999985, -3.7271509999999992, -3.780851000000002, -3.6115509999999986, -3.4124880000000033, -3.8527880000000003, -3.7375880000000006, -3.8264880000000012, -3.4370879999999993, -4.0668880000000005, -3.8614880000000014] # Proper shape x, y = np.array(x).reshape(-1, 1), np.array(y).reshape(-1, 1) # Fit linear model with RANSAC algorithm ransac = linear_model.RANSACRegressor(min_samples=2, max_trials=10000000) # Run regression 10 times with the same raw data for _ in range(10): ransac.fit(x, y) # Estimated coefficients m, c = float(ransac.estimator_.coef_), float(ransac.estimator_.intercept_) print("nm={:.3f}, c={:.3f}".format(m, c)) # R^2 & RMSE statistics inlier_mask = ransac.inlier_mask_ x_accpt, y_accpt = x[inlier_mask], y[inlier_mask] y_predict = m * x_accpt + c R2 = r2_score(y_accpt, y_predict) RMSE = np.sqrt(mean_squared_error(y_accpt, y_predict)) print("R^2: {:.3f}, RMSE: {:.3f}".format(R2, RMSE))
Best Answer
Here are some observations:
Increasing min_sample
makes the model much more stable. I believe the model will find it difficult to decide on what is an outlier when only sampling from from a minimum of two different samples.
I dropped this into a for-loop to iterate over some min_samples
and this real steadied the model. Here it is:
import numpy as np from sklearn import linear_model from sklearn.metrics import r2_score, mean_squared_error # Data x = [-0.099000000000000005, 0.094, 1.3129999999999999, 0.096000000000000002, 0.088999999999999996, -0.002, 0.11, 2.1440000000000001, -0.0040000000000000001, 0.245, 0.66600000000000004, 1.0029999999999999, 0.71899999999999997, 1.9359999999999999, 0.098000000000000004, 0.39400000000000002, 0.35699999999999998, 1.284, -0.027, 0.123, 1.837, 0.85299999999999998, 1.0669999999999999, 0.126, 0.078, -0.001, 0.318, 0.39900000000000002, 1.4259999999999999, 1.4970000000000001, 1.0860000000000001, -0.10000000000000001, 1.296, 0.44700000000000001, 0.23599999999999999, 0.26300000000000001, 1.5589999999999999, 1.1200000000000001, 1.2390000000000001, 1.0920000000000001, 1.1779999999999999, 1.915, -0.10299999999999999, 0.156, 0.53200000000000003, 1.871, -1.2170000000000001, 0.38] * 10 y = [-4.5064700000000002, -3.8593700000000002, -3.5898699999999977, -3.7799699999999987, -3.7595699999999983, -3.8206699999999998, -3.7413699999999981, -3.0731699999999975, -3.7439699999999974, -3.6922700000000006, -3.7251700000000003, -3.6609700000000007, -3.8754699999999982, -3.1616699999999973, -3.8635699999999993, -3.8146700000000013, -3.6913699999999992, -3.5147699999999986, -3.852369999999997, -3.8125699999999991, -0.58567000000000036, -3.81447, -3.6871699999999983, -3.6793700000000005, -3.8035699999999988, -3.8506699999999991, -3.6703700000000019, -3.7025699999999979, -2.6628699999999981, -3.5626700000000007, -3.0997699999999959, 1.2492300000000007, -3.7071510000000014, -3.9486509999999981, -3.7250509999999988, -3.9179509999999986, -3.5861509999999974, -3.8082509999999985, -3.7271509999999992, -3.780851000000002, -3.6115509999999986, -3.4124880000000033, -3.8527880000000003, -3.7375880000000006, -3.8264880000000012, -3.4370879999999993, -4.0668880000000005, -3.8614880000000014]*10 # Proper shape x, y = np.array(x).reshape(-1, 1), np.array(y).reshape(-1, 1) # Fit linear model with RANSAC algorithm # Run regression 10 times with the same raw data for n in range(2, 25): RMSE = [] ransac = linear_model.RANSACRegressor(min_samples=n, max_trials=10000000) for _ in range(10): ransac.fit(x, y) # Estimated coefficients m, c = float(ransac.estimator_.coef_), float(ransac.estimator_.intercept_) # print("nm={:.3f}, c={:.3f}".format(m, c)) # R^2 & RMSE statistics inlier_mask = ransac.inlier_mask_ x_accpt, y_accpt = x[inlier_mask], y[inlier_mask] y_predict = m * x_accpt + c R2 = r2_score(y_accpt, y_predict) RMSE.append(np.sqrt(mean_squared_error(y_accpt, y_predict))) # print("R^2: {:.3f}, RMSE: {:.3f}".format(R2, RMSE)) print('min_samples: %d' % n, 'std: %f' % np.std(RMSE))
and the output,
min_samples: 2 std: 0.003821 min_samples: 3 std: 0.003895 min_samples: 4 std: 0.002477 min_samples: 5 std: 0.002746 min_samples: 6 std: 0.003013 min_samples: 7 std: 0.002673 min_samples: 8 std: 0.002763 min_samples: 9 std: 0.001543 min_samples: 10 std: 0.001367 min_samples: 11 std: 0.001390 min_samples: 12 std: 0.001390 min_samples: 13 std: 0.001486 min_samples: 14 std: 0.001390 min_samples: 15 std: 0.001214 min_samples: 16 std: 0.001390 min_samples: 17 std: 0.000000 min_samples: 18 std: 0.000910 min_samples: 19 std: 0.000000
and a graph of the trend, x
is min_samples
and y
the standard deviation of the RMSE
.
Similar Posts:
- Solved – Sklearn RANSAC linear regression results vary wildly
- Solved – Can MAE be higher than MSE?
- Solved – Creating a dataframe includes the cross validation scores
- Solved – Using and interpreting $k$-fold cross validation for regression
- Solved – How to train sklearn model to predict square of an integer