I have a dataset which I have split into 3 parts: a training set, a cross-validation set and a test set. I have used the training set and cross-validation set to train 2 models. For this, I have taken the following steps:
Step 1: Determining optimal parameters.
For example, finding the optimal depth for a Decision Tree. For this, I use the following code:
scores = pd.Series(0, index=range(1, 60, 5), dtype=float) for depth in range(1, 60, 5): clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=depth) clf.fit(X_train, Y_train) scr = clf.score(X_cv, Y_cv) scores[depth] = scr
Then, I choose the depth with the best score.
Step 2: Cross-validating optimal parameters
For this, I use the following code:
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=best_depth) clf.fit(X_train,Y_train) scores = cross_val_score(clf, X_train, Y_train, cv=10, n_jobs=-1) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Step 3: Analyzing bias and variance:
For this, I use learning curves generated by the following function:
sze = np.arange(0.1, 1.1, 0.1) train_sizes, train_scores, valid_scores = learning_curve(clf, X_train, Y_train, train_sizes=sze, cv=10, n_jobs=-1)
Question:
I have done these 3 steps for 2 different models (Decision Tree and K Nearest Neighbors). Now, I would like to compare these models. What I have done is to calculate their score on the unseen test set (using the Scikit Learn score
function).
Are there any other ways to compare the performance of 2 models?
Best Answer
First, you need a distance function. It takes two inputs: the value of an observation from the test set, and the predicted value for the same observation based on some model. The output is $ge 0$.
For continuous variables, people usually use the distance squared: $D(y_i, hat{y}_i) = (y_i – hat{y}_i)^2$. But you could use anything else that you want. $D(y_i, hat{y}_i) = |y_i – hat{y}_i|$.
The distance tells you by how much the model missed the data. You should really ask yourself, "If the reality is $y_i$, but I think that it's $hat{y}_i$, how bad would that be?" You can and should create a distance function that fits your problem.
You might want to look at percent difference: $D(y_i, hat{y}_i) = |log(y_i) – log(hat{y}_i)|$. Or, maybe if the difference is small, then you don't care. So set $D = 0$ in those cases.
Next, you will aggregate $D$ across observations. With $D(y_i, hat{y}_i) = (y_i – hat{y}_i)^2$, people usually use the mean, which gives you the mean squared error. If you take the square root of that (to get out of the squared units), you get RMSE. With $D(y_i, hat{y}_i) = |y_i – hat{y}_i|$, people often use the median, giving you the median absolute deviation (MAD).
For a general distance function, I would use the median or some quantile higher than 0.5. The median gives you the distance for the "typical" case. A higher quantile gives you the distance for a "bad" case.
Finally, to select between two models, just pick one with the lowest aggregate distance.
Similar Posts:
- Solved – Choosing model from Walk-Forward CV for Time Series
- Solved – OOB Score vs test set accuray Random Forest
- Solved – About GBM function parameters (Bag.fraction, interaction.depth)
- Solved – About GBM function parameters (Bag.fraction, interaction.depth)
- Solved – Cross Validation and test ROC AUC scores match but train score doesn’t