# Solved – Why the Brier Score’s better when probabilities are estimated through PAVA instead of Platt Scaling

I've been studying (and applying) SVMs for some time now, mostly through `kernlab` in `R`.

`kernlab` allows probabilistic estimation of the outcomes through Platt Scaling, but the same could be achieved with a Pool Adjacent Violators (PAV) isotonic regression (Zadrozny and Elkan, 2002).

I've been wrapping my head over this and came with a (clunky, but it works, or yet I think it does) code to try the PAV algorithm.

I divided the task into three pairwise binary classification task, estimated the probabilities on the training data and coupled the pairwise probabilities to get class probabilities (Wu, Lin, and Weng, 2004).

Predictions were made on the training set. I set the Cost really low `C=0.001` to try to get some misclassifications.

The Brier Score is defined as:

\$\$BS=frac{1}Nsum_{t=1}^Nsum_{i=1}^R(f_{ti}-o_{ti})^2 \$\$

Where \$R\$ is the number of classes, \$N\$ is the number of instances, \$f_{ti}\$ is the forecast probability of the \$t\$-th instance belonging to the \$i\$-th class, and \$o_{ti}\$ is \$1\$, if the actual class \$y_t\$ is equal to \$i\$ and \$0\$, if the class \$y_t\$ is different from \$i\$.

``require(isotone) require(kernlab)  ##PAVA SET/VER data1   <-  iris[1:100,]        #only setosa and versicolor MR1 <-  c(rep(0,50),rep(1,100)) #target probabilities KSVM1   <-  ksvm(Species~., data=data1, type="C-svc", kernel="rbfdot", C=.001) PRED1   <-  predict(KSVM1,iris, type="decision")    #SVM decision function PAVA1   <-  gpava(PRED1, MR1)               #generalized pool adjacent violators algorithm   ##PAVA SET/VIR data2   <-  iris[c(1:50,101:150),]      #only setosa and virginica MR2 <-  c(rep(0,50),rep(1,50),rep(0,50))    #target probabilities KSVM2   <-  ksvm(Species~., data=data2, type="C-svc", kernel="rbfdot", C=.001) PRED2   <-  predict(KSVM2,iris, type="decision") PAVA2   <-  gpava(PRED2, MR2)  ##PAVA VER/VIR data3   <-  iris[51:150,]   #only versicolor and virginica MR3 <-  c(rep(0,100),rep(1,50)) #target probabilities KSVM3   <-  ksvm(Species~., data=data3, type="C-svc", kernel="rbfdot", C=.001) PRED3   <-  predict(KSVM3,iris, type="decision") PAVA3   <-  gpava(PRED3, MR3)  #Usual pairwise binary SVM KSVM    <-  ksvm(Species~.,data=iris, type="C-svc", kernel="rbfdot", C=.001,prob.model=TRUE)  #probabilities on the training data through Platt scaling and pairwise coupling PRED    <-  predict(KSVM,iris,type="probabilities")  #The usual KSVM response based on the sign of the decision function RES <-  predict(KSVM,iris)  #pairwise probabilities coupling algorithm on kernlab PROBS   <-  kernlab::couple(cbind(1-PAVA1\$x,1-PAVA2\$x,1-PAVA3\$x)) colnames(PROBS) <- c("setosa","versicolor","virginica")  #Brier score multiclass definition BRIER.PAVA  <-  sum( (cbind(rep(1,50),rep(0,50),rep(0,50))-PROBS[1:50,])^2, (cbind(rep(0,50),rep(1,50),rep(0,50))-PROBS[51:100,])^2, (cbind(rep(0,50),rep(0,50),rep(1,50))-PROBS[101:150,])^2)/150  #Brier score multiclass definition BRIER.PLATT <-  sum( (cbind(rep(1,50),rep(0,50),rep(0,50))-PRED[1:50,])^2, (cbind(rep(0,50),rep(1,50),rep(0,50))-PRED[51:100,])^2, (cbind(rep(0,50),rep(0,50),rep(1,50))-PRED[101:150,])^2)/150  BRIER.PAVA  BRIER.PLATT ``

Soon I'll clean up a bit and write a proper wrapper function to do it all, but this result's really worrisome for me.

``BRIER.PAVA   0.09801759 BRIER.PLATT   0.6710232 ``

The Brier Score I got from the probabilities estimated through PAVA is way better than the one we get on Platt Scaling.

If you check `PRED` you will see all probabilites fall on the ~0.33 range, while on `PROB` more extreme values (1 or 0) are expected, which was quite unexpected to me as I'm using a really low `C`.

References:

Zadrozny, B., and Elkan, C. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.

T.-F. Wu, C.-J. Lin, and Weng, R.C. "Probability estimates for multi-class classification by pairwise coupling." The Journal of Machine Learning Research 5 (2004): 975-1005.

EDIT:

Also, if you check the AUC of the different probabilities, they are quite high.

``requires(caTools)  AUC.PAVA<-caTools::colAUC(PROBS,iris\$Species)  AUC.PLATT<-caTools::colAUC(PRED,iris\$Species)  colMeans(AUC.PAVA) colMeans(AUC.PLATT) ``

And here's the result

``> colMeans(AUC.PAVA)     setosa versicolor  virginica   0.9988667  0.9988667  0.8455333  > colMeans(AUC.PLATT)     setosa versicolor  virginica   0.8913333  0.8626667  0.9656000  ``

Looking at these AUC, I would say Platt Scaling is a really underconfident technique.

Contents