Suppose I am doing random forest classification of labels $A$,$B$,$C$,$D$. There is some theoretical ordering to this output such that when $A$ is more likely than $B$, $B$ is also more likely than $C$, etc. Also, if $P(D) > P(C)$, we also have that $P(C) > P(B) > P(A)$. There are other such conditions that need to be met.

The issue is that a real random forest may give something silly that completely violates the above constraints, *even if* it is able to predict the most likely outcome successfully. For my use case the ordering is important since decisions are made not only on the most likely outcome.

It also seems intuitive that I should be able to improve generalization if I can somehow enforce this prior knowledge into the model.

How do I account for this in a decision forest? Despite this structure to the output I do not think it is possible to construct a real-valued response variable since they are *still class labels* with no natural real value, even if there is some type of ordering to them.

**Contents**hide

#### Best Answer

Here is a possibility: you could add a constraint to the optimization of the purity index (e.g. Gini Index or Entropy) to the individual trees in the forest. So: $$min,Sigma{D_i} ; with;D_i=1-Sigma^{k}p_{ik}^2$$ $$s.t., p_{ik} >= p_{i(k-1)} >= … >= p_0$$ where $k$ indexes the observation type, $i$ indexes the terminal node and $p_{ik}$ is the proportion of of $k$ on node $i$. That way your forest should yield results consistent with that as well. I guess you could relax the condition by introducing a slack variable $min,zeta_i$ with $p_0>zeta_0 > 0$ $p_0-zeta_0 <= p_1-zeta_1$, etc. for the other probs.

But if your data is correct and makes sense and that condition is true for sure your forest will yield results that are consistent with that condition. If you do an unconstrained forest with enough trees and you do not observe your $P(A) < P(B) < …$ it is quite likely you are mixing non-comparable data sets or that the condition is simply not true.

### Similar Posts:

- Solved – How to incorporate constraints in random forest output
- Solved – Are nominal attributes strict classifications and equivalent to enumerations in programming languages
- Solved – pROC versus ROCR
- Solved – Better performance using Random Forest one-Vs-All than Random Forest multiclass
- Solved – the best source to learn Random-forest algorithm in Matlab from scratch