I have a qualifying exam later this week based on material similar to what is covered in Casella and Berger, and am studying past exams. It appears that past exams used other texts besides Casella and Berger (which I'm familiar with), and for the most part, I've been able to figure out the concepts and solve the problems in these situations. However, I've been stuck on this particular problem.
Below is a table giving $theta = 0$ and $theta = 1$ pmfs $f(x mid
theta)$ for a discrete random variable $X$.$$begin{array}{|c|c|c|c|c|c|c|c|} hline & x = 1 & x = 2 & x = 3 & x
= 4 & x = 5 & x = 6 & x = 7 \ hline theta = 1 & .10 & .20 & .15 & .10 & .05 & .30 & .10 \ theta = 0 & .20 & .10 & .20 & .05 & .10 &
.15 & .20 \ hline end{array}$$Suppose that a priori there is probability .6 that $theta = 1$.
Identify a minimum error rate (minimum 0-1 loss Bayes risk)
classification rule for deciding between $theta = 0$ and $theta = 1$
based on $X$. (Give values of a decision rule $d(x)$ for $x = 1,
dots, 7$.)
Here's what I know:
- We've assigned a prior distribution to $theta$, where $pi(theta) = begin{cases}
.6, & theta = 1 \
.4, & theta = 0text{.}
end{cases}$ - Zero-one loss is a loss function which is $1$ if you classify a point incorrectly, $0$ if you don't.
- To find an estimator which minimizes the Bayes risk – say $T = h(X)$, such an estimator is such that at each fixed data point $x$ of the random variable $X$ that $mathbb{E}_{theta mid X}[L(h(x), theta)]$, with $L$ being the loss function, is minimized.
- The likelihood ratio $Lambda(X) = dfrac{f(X mid 1)}{f(X mid 0)}$ is a minimum sufficient statistic for $theta$.
I do have the solution to this problem: it starts off by saying
$$d(x) = begin{cases}
1 & .6cdot f(xmid 1) > .4cdot f(x mid 0) \
0 & .6cdot f(xmid 1) < .4cdot f(x mid 0)
end{cases}$$
but I have no idea how the condition $.6cdot f(xmid 1) > .4cdot f(x mid 0)$ is derived. This appears to just be saying that $f_{theta mid X}(1 mid x) > f_{theta mid X}(0 mid x)$ where $f_{theta mid X}$ is the posterior pmf of $theta$, but I don't see how this makes sense.
Best Answer
Any decision rule for the likelihood matrix $$begin{array}{|c|c|c|c|c|c|c|c|} hline & x = 1 & x = 2 & x = 3 & x = 4 & x = 5 & x = 6 & x = 7 \ hline H_1 & .10 & .20 & .15 & .10 & .05 & .30 & .10 \ H_0 & .20 & .10 & .20 & .05 & .10 & .15 & .20 \ hline end{array}$$ can be defined by marking (e.g. by making it boldface) one entry in each column of the likelihood matrix, say like $$begin{array}{|c|c|c|c|c|c|c|c|} hline & x = 1 & x = 2 & x = 3 & x = 4 & x = 5 & x = 6 & x = 7 \ hline H_1 & mathbf{.10} & mathbf{.20} & .15 & mathbf{.10} & .05 & .30 & .10 \ H_0 & .20 & .10 & mathbf{.20} & .05 & mathbf{.10} & mathbf{.15} & mathbf{.20} \ hline end{array}$$ meaning that when $x$ equals $1, 2$, or $4$, we decide that $H_1$ is true while if $x$ equals $3,5,6$, or $7$, we decide that $H_0$ is true. The false-alarm probability $P_{FA}$ (a.k.a. probability of Type I error), the probability of choosing $H_1$ when in fact $H_0$ is true is the sum $(0.35)$ of the unbolded entries on the $H_0$ row, while the missed-detection probability $P_{MD}$ (a.k.a probability of Type II error) is the sum $(0.6)$ of the unshaded entries on the $H_1$ row. (Hey, I never said that it was a good decision rule!).
The average error probability of this decision rule is $$P_e = P_{FA}P(H_0) + P_{MD}P(H_1) = 0.35times 0.4 + 0.6times 0.6 = 0.50$$ and an easy way of visualizing this is to convert the likelihood matrix into the joint probability matrix by multiplying the entries in each row by the probability of the hypothesis, while retaining the boldfaces. This gives us $$begin{array}{|c|c|c|c|c|c|c|c|} hline & x = 1 & x = 2 & x = 3 & x = 4 & x = 5 & x = 6 & x = 7 \ hline H_1 & mathbf{.06} & mathbf{.12} & .09 & mathbf{.06} & .03 & .18 & .06 \ H_0 & .08 & .04 & mathbf{.08} & .02 & mathbf{.04} & mathbf{.06} & mathbf{.08} \ hline end{array}$$ and $P_e$ is just the sum of all the unbolded entries in the joint probability matrix.
So, which decision rule minimizes $P_e$? Well, we have to mark one entry in each column of the joint probability matrix and whichever one we don't mark contributes to $P_e$, and the answer is obvious:
mark the larger of the two entries in each column of the joint probability matrix!
The minimum error probability of error rule is thus $$begin{array}{|c|c|c|c|c|c|c|c|} hline & x = 1 & x = 2 & x = 3 & x = 4 & x = 5 & x = 6 & x = 7 \ hline H_1 & .06 & mathbf{.12} & mathbf{.09} & mathbf{.06} & .03 & mathbf{.18} & .06 \ H_0 & mathbf{.08} & .04 & .08 & .02 & mathbf{.04} & .06 & mathbf{.08} \ hline end{array}$$ and it achieves a $P_e$ of $0.35$.
We can gussy up this simple notion by saying that when we observe $x$, we decide in favor of $H_1$ precisely when $P(x,H_1)>P(x,H_0)$, that is, when $P(H_1mid x)P(x) > P(H_0mid x)P(x)$, which is the same as $P(H_1mid x) > P(H_0mid x)$:
MAP (maximum a posteriori probability) decision rule: Choose the hypothesis with the larger a posteriori probability
and arrive at the usual claim that the MAP decision rule minimizes the error probability, but why it does so is more intuitively obvious to me via the development above; ymmv.
Similar Posts:
- Solved – To show that Bayes classifier has best error rate
- Solved – Converting 2nd order Markov chain to the 1st order equivalent
- Solved – Converting 2nd order Markov chain to the 1st order equivalent
- Solved – Confusion matrix, metrics, & joint vs. conditional probabilities
- Solved – Throwing two four-sided dice; min/max problems