I've been studying (and applying) SVMs for some time now, mostly through kernlab
in R
.
kernlab
allows probabilistic estimation of the outcomes through Platt Scaling, but the same could be achieved with a Pool Adjacent Violators (PAV) isotonic regression (Zadrozny and Elkan, 2002).
I've been wrapping my head over this and came with a (clunky, but it works, or yet I think it does) code to try the PAV algorithm.
I divided the task into three pairwise binary classification task, estimated the probabilities on the training data and coupled the pairwise probabilities to get class probabilities (Wu, Lin, and Weng, 2004).
Predictions were made on the training set. I set the Cost really low C=0.001
to try to get some misclassifications.
The Brier Score is defined as:
$$BS=frac{1}Nsum_{t=1}^Nsum_{i=1}^R(f_{ti}-o_{ti})^2 $$
Where $R$ is the number of classes, $N$ is the number of instances, $f_{ti}$ is the forecast probability of the $t$-th instance belonging to the $i$-th class, and $o_{ti}$ is $1$, if the actual class $y_t$ is equal to $i$ and $0$, if the class $y_t$ is different from $i$.
require(isotone) require(kernlab) ##PAVA SET/VER data1 <- iris[1:100,] #only setosa and versicolor MR1 <- c(rep(0,50),rep(1,100)) #target probabilities KSVM1 <- ksvm(Species~., data=data1, type="C-svc", kernel="rbfdot", C=.001) PRED1 <- predict(KSVM1,iris, type="decision") #SVM decision function PAVA1 <- gpava(PRED1, MR1) #generalized pool adjacent violators algorithm ##PAVA SET/VIR data2 <- iris[c(1:50,101:150),] #only setosa and virginica MR2 <- c(rep(0,50),rep(1,50),rep(0,50)) #target probabilities KSVM2 <- ksvm(Species~., data=data2, type="C-svc", kernel="rbfdot", C=.001) PRED2 <- predict(KSVM2,iris, type="decision") PAVA2 <- gpava(PRED2, MR2) ##PAVA VER/VIR data3 <- iris[51:150,] #only versicolor and virginica MR3 <- c(rep(0,100),rep(1,50)) #target probabilities KSVM3 <- ksvm(Species~., data=data3, type="C-svc", kernel="rbfdot", C=.001) PRED3 <- predict(KSVM3,iris, type="decision") PAVA3 <- gpava(PRED3, MR3) #Usual pairwise binary SVM KSVM <- ksvm(Species~.,data=iris, type="C-svc", kernel="rbfdot", C=.001,prob.model=TRUE) #probabilities on the training data through Platt scaling and pairwise coupling PRED <- predict(KSVM,iris,type="probabilities") #The usual KSVM response based on the sign of the decision function RES <- predict(KSVM,iris) #pairwise probabilities coupling algorithm on kernlab PROBS <- kernlab::couple(cbind(1-PAVA1$x,1-PAVA2$x,1-PAVA3$x)) colnames(PROBS) <- c("setosa","versicolor","virginica") #Brier score multiclass definition BRIER.PAVA <- sum( (cbind(rep(1,50),rep(0,50),rep(0,50))-PROBS[1:50,])^2, (cbind(rep(0,50),rep(1,50),rep(0,50))-PROBS[51:100,])^2, (cbind(rep(0,50),rep(0,50),rep(1,50))-PROBS[101:150,])^2)/150 #Brier score multiclass definition BRIER.PLATT <- sum( (cbind(rep(1,50),rep(0,50),rep(0,50))-PRED[1:50,])^2, (cbind(rep(0,50),rep(1,50),rep(0,50))-PRED[51:100,])^2, (cbind(rep(0,50),rep(0,50),rep(1,50))-PRED[101:150,])^2)/150 BRIER.PAVA BRIER.PLATT
Soon I'll clean up a bit and write a proper wrapper function to do it all, but this result's really worrisome for me.
BRIER.PAVA [1] 0.09801759 BRIER.PLATT [1] 0.6710232
The Brier Score I got from the probabilities estimated through PAVA is way better than the one we get on Platt Scaling.
If you check PRED
you will see all probabilites fall on the ~0.33 range, while on PROB
more extreme values (1 or 0) are expected, which was quite unexpected to me as I'm using a really low C
.
References:
EDIT:
Also, if you check the AUC of the different probabilities, they are quite high.
requires(caTools) AUC.PAVA<-caTools::colAUC(PROBS,iris$Species) AUC.PLATT<-caTools::colAUC(PRED,iris$Species) colMeans(AUC.PAVA) colMeans(AUC.PLATT)
And here's the result
> colMeans(AUC.PAVA) setosa versicolor virginica 0.9988667 0.9988667 0.8455333 > colMeans(AUC.PLATT) setosa versicolor virginica 0.8913333 0.8626667 0.9656000
Looking at these AUC, I would say Platt Scaling is a really underconfident technique.
Best Answer
Isotonic regression tends to overfit small data, while Platt Scaling is way more relaxed (being based on logistic regression and all). On large data they converge (I tested it on simulated large data).
As on my example above I train and test on the same data, it's obviously overfitted.
Similar Posts:
- Solved – Isn’t caret SVM classification wrong when class probabilities are included
- Solved – Isn’t caret SVM classification wrong when class probabilities are included
- Solved – Classification with unknown class
- Solved – Extract training data predictions from rpart
- Solved – how to perform classification using function train in caret in R