"The set of points in $mathbb{R}^2$ classified ORANGE corresponds to

{$x:x^Tβ>0.5$}, indicated in Figure 2.1, and the two predicted classes

are separated by the decision boundary {$x:x^Tβ=0.5$}, which is linear

in this case. We see that for these data there are several

misclassifications on both sides of the decision boundary. Perhaps our

linear model is too rigid—or are such errors unavoidable? Remember

that these are errors on the training data itself, and we have not

said where the constructed data came from. Consider the two possible

scenarios:Scenario 1: The training data in each class were generated from

bivariate Gaussian distributions with uncorrelated components and

different means.Scenario 2: The training data in each class came from a mixture of 10

low- variance Gaussian distributions, with individual means themselves

distributed as Gaussian.A mixture of Gaussians is best described in terms of the generative

model. One first generates a discrete variable that determines which

of the component Gaussians to use, and then generates an observation

from the chosen density. In the case of one Gaussian per class, we

will see in Chapter 4 that a linear decision boundary is the best one

can do, and that our estimate is almost optimal. The region of overlap

is inevitable, and future data to be predicted will be plagued by this

overlap as well. In the case of mixtures of tightly clustered

Gaussians the story is different. A linear decision boundary is

unlikely to be optimal, and in fact is not. The optimal decision

boundary is nonlinear and disjoint, and as such will be much more

difficult to obtain."

Can someone please explain to me what the particular scenarios mean? It's from Elements of Statistical Learning by Tibshirani

**Contents**hide

#### Best Answer

**In scenario 1,** there are two bivariate Normal distributions. Here I show two such probability density functions (PDFs) superimposed in a pseudo-3D plot. One has a mean near $(0,0)$ (at the left) and the other has a mean near $(3,3)$.

Samples are drawn independently from each. I took the same number ($300$) so that we wouldn't have to compensate for different sample sizes in evaluating these data.

*Point symbols distinguish the two samples. The gray/white background is the best discriminator: points in gray are more likely to arise from the second distribution than the first. (The discriminator is elliptical, not linear, because these distributions have slightly different covariance matrices.)*

**In scenario 2** we will look at two comparable datasets produced using *mixture distributions.* There are two mixtures. Each one is determined by ten distinct Normal distributions. They all have different covariance matrices (which I do not show) and different means. Here are the locations of their means (which I have termed "nuclei"):

A mixture of Gaussians is best described in terms of the generative model. One first generates a discrete variable that determines which of the component Gaussians to use, and then generates an observation from the chosen density.

To draw an set of independent observations from a mixture, you first pick one of its components at random and then draw a value from that component. The PDF of a mixture is a weighted sum of PDFs of the components, with the weights being the chance of selecting each component in that first stage. Here are the PDFs of the two mixtures. I drew them with a little extra transparency so you can see them better in the middle where they overlap:

*To make the two scenarios easier to compare, the means and covariance matrices of these two PDFs we chosen to closely match the corresponding means and covariances of the two bivariate Normal PDFs used in scenario 1.*

To emulate scenario 2 (the mixture distributions), I drew samples of 300 independent values from each of the two datasets by selecting each of their components with a probability of $1/10$ and then independently drawing a value from the selected component. Because the selection of components is random, the number of draws from each component was not always exactly $30 = 300 times 1/10$, but it was usually close to that. Here is the result:

*The black dots show the ten component means for each of the two distributions. Clustered around each black dot are approximately 30 samples. However, there is much intermingling of values, so it is impossible from this figure to determine which samples were drawn from which component.*

In the case of mixtures of tightly clustered Gaussians the story is different. A linear decision boundary is unlikely to be optimal, and in fact is not. The optimal decision boundary is nonlinear and disjoint, and as such will be much more difficult to obtain."

The background in that last figure is the best discriminator for these two mixture distributions. It is complicated because the distributions are complicated; obviously it is not just a line or smooth curve, such as appeared in scenario 1.

I believe the entire point of this comparison lies in our option, as analysts, to *choose* which model we want to use to analyze either one of these two datasets. Because we would not in practice know which model is appropriate, we could try using a mixture model for the data in scenario 1 and we could equally well try using a Normal model for the data in scenario 2. We would likely be fairly successful in any case due to the relatively low overlap (between blue and red sample points). Nevertheless, **the different ( equally valid) models can produce distinctly different discriminators** (especially in areas where data are sparse).

### Similar Posts:

- Solved – Can someone please explain to me what the particular scenarios mean
- Solved – Can someone please explain to me what the particular scenarios mean
- Solved – “mixture” in a gaussian mixture model
- Solved – “mixture” in a gaussian mixture model
- Solved – Bayes decision boundary of Figure 2.5 in Elements of Statistical Learning