I am in charge of presenting the results of A/B tests (run on website variations) at my company. We run the test for a month and then check the p-values at regular intervals until we reach significance (or abandon if significance is not reached after running the test for a long time), something I am now finding out is a mistaken practice.
I want to stop this practice now, but to do that, I want to understand WHY this is wrong. I understand that the effect size, the sample size (N), the alpha significance criterion (α) and statistical power, or the chosen or implied beta (β) are mathematically related. But what exactly changes when we stop our test before we reach the required sample size?
I have read a few posts here (namely this, this and this), and they tell me that my estimates would be biased and the rate of my Type 1 error increases dramatically. But how does that happen? I am looking for a mathematical explanation, something that would clearly show the effects of sample size on the outcomes. I guess it has something to do with the relationships between the factors I mentioned above, but I have not been able to find out the exact formulas and work them out on my own.
For eg., stopping the test prematurely increases Type 1 error rate. Alright. But why? What happens to increase type 1 error rate? I'm missing the intuition here.
Help please.
Best Answer
A/B tests that simply test repeatedly on the same data with a fixed type-1 error ($alpha$) level are fundamentally flawed. There are at least two reasons why this is so. First, the repeated tests are correlated but the tests are conducted independently. Second, the fixed $alpha$ does not account for the multiply conducted tests leading to type-1 error inflation.
To see the first, assume that upon each new observation you conduct a new test. Clearly any two subsequent p-values will be correlated because $n-1$ cases have not changed between the two tests. Consequently we see a trend in @Bernhard's plot demonstrating this correlatedness of p-values.
To see the second, we note that even when tests are independent the probability of having a p-value below $alpha$ increases with the number of tests $t$ $$P(A) = 1-(1-alpha)^t,$$ where $A$ is the event of a falsely rejected null hypothesis. So the probability to have at least one positive test result goes against $1$ as you repeatedly a/b test. If you then simply stop after the first positive result, you will have only shown the correctness of this formula. Put differently, even if the null hypothesis is true you will ultimately reject it. The a/b test is thus the ultimate way of finding effects where there are none.
Since in this situation both correlatedness and multiple testing hold at the same time, the p-value of the test $t+1$ depend on the p-value of $t$. So if you finally reach a $p< alpha$, you are likely to stay in this region for a while. You can also see this in @Bernhard's plot in the region of 2500 to 3500 and 4000 to 5000.
Multiple testing per-se is legitimate, but testing against a fixed $alpha$ is not. There are many procedures that deal with both the multiple testing procedure and correlated tests. One family of test corrections is called the family wise error rate control. What they do is to assure $$P(A) le alpha.$$
The arguably most famous adjustment (due to its simplicity) is Bonferroni. Here we set $$alpha_{adj} = alpha/t,$$ for which it can easily be shown that $P(A) approx alpha$ if the number of independent tests is large. If tests are correlated it is likely to be conservative, $P(A) < alpha$. So the easiest adjustment you could make is dividing your alpha level of $0.05$ by the number of tests you have already made.
If we apply Bonferroni to @Bernhard's simulation, and zoom in to the $(0,0.1)$ interval on the y-axis, we find the plot below. For clarity I assumed we do not test after each coin flip (trial) but only every hundredth. The black dashed line is the standard $alpha = 0.05$ cut off and the red dashed line is the Bonferroni adjustment.
As we can see the adjustment is very effective and demonstrates how radical we have to change the p-value to control the family wise error rate. Specifically we now do not find any significant test anymore, as it should be because @Berhard's null hypothesis is true.
Having done this we note that Bonferroni is very conservative in this situation due to the correlated tests. There are superior tests that will be more useful in this situation in the sense of having $P(A) approx alpha$, such as the permutation test. Also there is much more to say about testing than simply referring to Bonferroni (e.g. look up false discovery rate and related Bayesian techniques). Nevertheless this answers your questions with a minimum amount of math.
Here is the code:
set.seed(1) n=10000 toss <- sample(1:2, n, TRUE) p.values <- numeric(n) for (i in 5:n){ p.values[i] <- binom.test(table(toss[1:i]))$p.value } p.values = p.values[-(1:6)] plot(p.values[seq(1, length(p.values), 100)], type="l", ylim=c(0,0.1),ylab='p-values') abline(h=0.05, lty="dashed") abline(v=0) abline(h=0) curve(0.05/x,add=TRUE, col="red", lty="dashed")
Similar Posts:
- Solved – Why is it wrong to stop an A/B test before optimal sample size is reached
- Solved – Why is it wrong to stop an A/B test before optimal sample size is reached
- Solved – About the Bonferroni correction
- Solved – About the Bonferroni correction
- Solved – How to apply Bonferroni correction when including an interaction term