I can't figure out how to interpret the `Prediction`

in the results of `pretty.gbm.tree`

called on a single tree from a `gbm`

trained on a binary outcome with the `bernoulli`

loss function. I'm using `gbm`

v2.1.1.

**Example**

`data('ptitanic', package='rpart.plot') # note this is not the default data(Titanic) ptitanic$died <- 2-as.integer(ptitanic$survived) #survived is fctr w/ 2 levels died/survived mean(ptitanic$died) # 0.618 death rate form <- as.formula('died ~ sex + age + pclass + sibsp + parch') library('gbm') set.seed(1) m <- gbm(form, distribution = 'bernoulli', data = ptitanic, interaction.depth=4, n.trees=50) summary(m) mean(predict(m, ptitanic, type='response',n.trees=50)) # 0.618 death rate # let's look at the 1st tree t <- pretty.gbm.tree(m, i=1) # I want to see the split variable names instead of indices # The indices are -1 for terminal, 0 for first term, 1 for second term, etc. t$SplitVar <- c('Terminal',attr(terms(form),'term.labels'))[t$SplitVar+2] # The predictions at nodes look like: head(t$Prediction) # [1] -2.066845e-05 -1.472631e-03 -2.374948e-03 -4.808952e-04 -1.472631e-03 7.829118e-04 `

**What I have tried:**

According to `?predict.gbm`

, the function will return a vector of log-odds

Returns a vector of predictions. By default the predictions are on the

scale of f(x). For example, for the Bernoulli loss the returned value

is on the log odds scale, …

So it sems like `Prediction`

in the tree ought to be log-odds, and I should be able to get the probability estimate with:

$x = ln{frac{p}{1-p}}$

$p = frac{1}{frac{1}{e^{x}}+1}$

i.e.:

`t$OR <- exp(t$Prediction) t$Prob <- 1/(1/t$OR + 1) head(t$Prob) #[1] 0.5000094 0.4996384 0.4994387 0.4998654 0.4996384 0.5002175 `

What is very strange is that the odds ratio at the root node is ~1, or p = 0.5 — despite the overall death rate of 61.8% mentioned above. So maybe it is trying to tell me relative risk?

There is a somewhat cryptic detail in `?predict.gbm`

:

The predictions from gbm do not include the offset term. The user may

add the value of the offset to the predicted value if desired.

Is adding an offset something I need to do when looking at the very first tree? (I could maybe understand the need to do additive effects in subsequent trees, but the first one?) If so, how do I do that?

**Contents**hide

#### Best Answer

There's a subtlety about how the gbm algorithm works that you are missing, and it's leading to your confusion.

The `predict`

method returns predictions from the entire boosted model, and these are indeed log-odds when fitting to minimize the Bernoulli deviance. On the other hand, the predictions from the individual trees are *not* log odds, they are something quite different.

Indeed, while the entire model in total is fit to predict the response, the individual trees are not, the individual trees are fit to predict *the gradient of the loss function evaluated at the current prediction and the response*. The is the "gradient" part of "gradient boosting".

Here's a minimal example that will hopefully clarify what is going on, I'll use a booster minimizing the gaussian loss function to keep the math simple and focus on the important concepts.

`x <- seq(0, 1, length.out = 6) y <- c(0, 0, 0, 1, 1, 1) df <- data.frame(x = x, y = y) M <- gbm(y ~ x, data = df, distribution="gaussian", n.trees = 1, bag.fraction = 1.0, n.minobsinnode = 1, shrinkage = 1.0) t <- pretty.gbm.tree(M, i = 1) t[, c("SplitVar", "LeftNode", "RightNode", "MissingNode", "Prediction")] `

Which looks like

` SplitVar LeftNode RightNode MissingNode Prediction 0 0 1 2 3 0.0 1 -1 -1 -1 -1 -0.5 2 -1 -1 -1 -1 0.5 3 -1 -1 -1 -1 0.0 `

Let me break this down. In this model, we are miinimizing the following loss function:

$$ L(y, hat y) = frac{1}{2} (y – hat y)^2 $$

The gradient with respect to the prediction is:

$$ nabla L (y, hat y) = y – hat y $$

The tree is fit to predict the value of *this* function.

In detail, the model starts out at the zero'th stage predicting a constant, the mean response. In our example data, the mean response is $0.5$. Then, the gradient of the loss function is evaluated at the data and current predictions:

`grad <- function(y, current_preds) { y - current_preds } grad(y, 0.5) `

which results in

`[1] -0.5 -0.5 -0.5 0.5 0.5 0.5 `

Now you can see what has happened by comparing the tree predictions to this. The tree predictions have recovered exactly the gradients.

The same thing is true in the case of a bernoulli model, though the details are more complex. The loss function being minimized is

$$ L(y, f) = y f – log(1 + e^f) $$

Note here that $f$ is the predicted log-oods, not the probability. The gradient is

$$ nabla L(y, f) = y – frac{1}{1 + e^{-f}} $$

and it is *this* that the predictions from the trees are approximating.

In short, the predictions from the trees in a gradient booster are not interpretable by comparison to the response, you must take great care when interpreting the internal structure of a boosted model.

### Similar Posts:

- Solved – interpretation of gbm single tree prediction in pretty.gbm.tree
- Solved – interpretation of gbm single tree prediction in pretty.gbm.tree
- Solved – Gradient in Gradient Boosting
- Solved – How to interpret prediction output in GBM() in R for classification problem
- Solved – Sampling : Gradient Boosting Tree