# Solved – How to combine different predictions together

Is there a way to combine different predictions to create a prediction?

The way I am thinking about it is, a lot of times while learning different things, we learn small separate parts of a subject and are able to combine them together to get "the big picture".

Similarly, if I use some algorithms to learn different small things from a dataset, can I train a separate algorithm to create a big picture out of it?

Contents

What you are describing can be done with ensemble methods. Let me give you and overview. The goal is to train a classifier for subsets of the data set and combine them, such that the classification rate is optimized in some way.

# Ensemble Methods

Ensemble Methods are a type of meta algorithm based on the following idea: Consider a multinational corporation facing a difficult decision, i.e. they have to decided between going with strategy \$A\$, \$B\$ or \$C\$. To make an informed decision, the executive board asks some experts for their advice. After presenting the facts to the experts, each of those then recommends a strategy. The executive board is then presented with the recommendations of the experts and decides to go with the strategy that was recommended the most. More formally, \$n\$ Experts \$e_i\$ give \$n\$ responses \$v_i\$ (votes), which are accumulated over each of the \$l\$ possible decisions \$d_j\$:

begin{equation} v_{i,j} = begin{cases} 1 & e_{i} textrm{ votes for } d_{j} \ 0 & text{else} end{cases} label{eq:cases} end{equation}

The total votes \$V_j\$ for each \$d_j\$ is then calculated as follows:

begin{equation} V_j = sum_{i=0}^{n}{v_{i,j}}, textrm{ j = 1, …, l} end{equation}

The final decision \$V_{text{final}}\$ could then be made by taking the majority vote, i.e.

begin{equation} V_{text{final}} = text{argmax}(V_j)end{equation}

How can the executive board make sure that the decisions they are presented with can be trusted? It has to be ensured that members of this ensemble are selected such that:

``1. They have gone through extensive training, i.e. they are truly experts 2. Each cover different areas of expertise (diversity) 3. There are sufficiently enough experts ``

One could imagine an example where the executive board is not trusting the experts equally. This can be expressed by weighted voting with introducing a weighting factor \$alpha_i in [0, 1]\$ for each of the \$n\$ experts:

begin{equation} V_j = sum_{i=0}^{n}alpha_i {v_{i,j}}, textrm{j = 1, …, l} end{equation}

Given the fact that the experts can't always be right, i.e. they make errors, one might ask how to approximate the error rate of the entire ensemble given the individual errors of the experts within. Let \$p\$ be the probability that any given expert is wrong. The probability of getting \$m leq n\$ wrong answers can then be seen as the probability of an event (being wrong) occurring \$m\$ times in a row with probability of \$p\$ for each trial. This can be modeled as a Bernoulli Experiment and can therefore be computed using the Binomial Distribution:

begin{equation} P(m) = frac{n!}{m!(n-m)!}p^{m}(1-p)^{n-m} end{equation}

This concept can be directly transferred to machine learning and statistics. In this context the experts are called classifiers or (weak) learners. Ensembles can be used for classification, where the goal is to decide for a given sample the corresponding class. In the earlier example the sample would be current state of the corporation and the classes would be the possible strategies. A optimal class is then a strategy that fits the current needs and goals of the corporation the best. Ensembles are also used for regression. Regression is the task of fitting a function with minimal error.

# Bootstrap Aggregation – Bagging

Proposed by Leo Breiman in 1996 , bagging is a meta algorithm designed to improve classification results by randomly creating subsets of the training data.

Assume we have a set \$T\$ of training data. The goal is to separate \$T\$ such that the resulting subsets approximate the underlying distribution well. After the subsets are created, one classifier (e.g. a classification tree) per subset is trained. The subsets are cre- ated by uniformly drawing \$n_B\$ samples \$x ∈ T\$ with replacement. It follows that a given sample \$x_i\$ may or may not be represented. Aslam et al.  showed in 2007, that when \$n_B = |T|\$, the expected fraction of singleton samples is \$1 − frac{1}{e}\$ (≈ 63%).

The resulting sample set \$T_B\$ is called the bootstrap sample. The set of bootstrap samples \${T_B}\$ approximates the underlying distribution.

The following algorithm shows the complete bagging procedure:

Data: Data set \$T\$ , set of classifiers \$L\$, number of iterations \$n_I\$, bootstrap sample size \$n_B\$

Result: Trained ensemble \$L_E\$

`` foreach i ∈ {1, ..., n I } do     Ti ← bootstrap sample from T with size nB     Train Li on Ti     LE ← LE U Li   end  return LE ``

 Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

 Javed A. Aslam, Raluca A. Popa, and Ronald L. Rivest. On estimating the size and confidence of a statistical audit. EVT, 7, 2007.

Rate this post