How do you calculate sample size for multiple treatments? By this, I mean two things:

- Two treatments and one control, where I want to compare T1 vs Control and T2 vs Control. Is the answer as simple as doing a power analysis for T1 vs Control, then using the same sample size for T1 on T2? I can't get a straight answer stating that anywhere (maybe it's too "obvious" for most people, but not me)
- Three treatments, where I want to compare them all together (so there would be 3 comparisons in total). I imagine I'll have to change alpha to .05/3 right? The best solution I can find is a power calculation formula for ANOVA, but ANOVA isn't exactly the type of analysis I will do. Instead, I will run 3 regressions for each permutation, not just a full ANOVA (ANOVA doesn't tell me WHICH mean is largest).

The programs I'm familiar with are GPower, PowerUp, and Stata's -power- command. I've seen some theoretical papers on this, but no simple practical guide. Why is that? Like sociologically, why are there so few practical guides or cookbooks on power analysis special extensions? There are a million on simple two-sample t-test, but none on multiple treatments or interactions. Is it because there is no consensus on how to do it well? Do academics in practice ignore these nuances? I'm not necessarily trying to get the more rigorous statistical answer, but just to understand what social scientists find as "good enough" in practice.

Just to show I did due diligence, I found these old posts that weren't detailed enough:

ANCOVA vs multiple regression the same: so why different power analysis results?

How to calculate power (or sample size) for a multiple comparison experiment?

Gpower: calculate power of multiple regression analysis with between groups

power analysis for factorial design

**Contents**hide

#### Best Answer

First, we have to think clearly about what tests we're going to conduct. I'm not a huge fan of using the linear probability model, although it can be done in this case, as you only have categorical explanatory variables. (Note that you cannot have a constant SD with binary data where the proportions differ.) Do you just want a test that the conditions differ? Do you have to follow that up with planned comparisons? What test for those? How do you want to account for multiple comparisons? You can't just 'do a power analysis' until a lot of decisions have been made. None of that is meant to be critical; I'm trying to point out why you can't just find a simple answer by Googling.

I'm not generally a fan of canned power analyses, unless the situation is very simple and directly maps onto a simple, canonical test. In general, I prefer to simulate the alternative hypothesis / data generating process that I am proposing and conduct the sequence of tests that I intend. This also helps me to think through the statistical analysis plan for the study, and helps me think about what the data could look like, what I might think about them, and what I'd conclude. There are a lot more nuances than people often realize. For a more detailed exposition, it may help you to read through my answer here: Simulation of logistic regression power analysis – designed experiments (the code is rather clunky, but hopefully easy to follow).

Fortunately perhaps, your situation *does* correspond to simple analyses where it is easy to apply a canned power analysis. Specifically, if you just want to see if the three conditions differ, given that you only have three categorical conditions and the outcome data are binary (lived/died), this corresponds to a chi-squared test of a 2×3 contingency table. Alternatively, if you just want to test if $T_1$ differs from $C$, and if $T_2$ differs from $C$, you can conduct two $z$-tests of differences in proportions. Those won't be independent, so you might want to use a Bonferroni correction, in which case you simply use $alpha=.025$ in your power analyses, and then use whichever $n$ is bigger. I can demonstrate these using the `pwr`

library in `R`

. (It might help you to work through the introductory vignette.)

First, I input the probabilities you have specified as the alternative hypothesis. Then I compute Cohen's measure of effect size, $w$, for a two-way contingency table. The contingency table will have $(r-1)(c-1)=2$ degrees of freedom, so we can simply get the required $N$ using the canned function ?pwr.chisq.test:

`library(pwr) # C T1 T2 # conditions P = rbind(c(.20, .15, .10), # prob die c(.80, .85, .90) ) # prob live P = P/3; P # matrix of cell probabilities # [,1] [,2] [,3] # [1,] 0.06666667 0.0500000 0.03333333 # [2,] 0.26666667 0.2833333 0.30000000 w = ES.w2(P=P); w # [1] 0.1143324 # Cohen's measure of effect size w pwr.chisq.test(w=w, N=NULL, df=2, sig.level=.05, power=.80) # # Chi squared power calculation # # w = 0.1143324 # N = 737.0537 # df = 2 # sig.level = 0.05 # power = 0.8 # # NOTE: N is the number of observations ceiling(737.0537/3) # [1] 246 # you'll need n=246 participants in each condition `

A different approach is simply to conduct two separate tests of the treatment conditions against the control. Since these aren't independent, we can test both against a lower alpha. Once again, first we stipulate the probabilities you want to be able to detect, then compute Cohen's measure of effect size, $h$. From there, it's easy to get the required $N$ from the canned function ?pwr.2p.test:

`h1 = ES.h(.20, .15); h1 # [1] 0.1318964 # Cohen's measure of effect size h h2 = ES.h(.20, .10); h2 # [1] 0.2837941 pwr.2p.test(h=h1, n=NULL, sig.level=0.025, power=.80) # # Difference of proportion power calculation for binomial distribution # (arcsine transformation) # # h = 0.1318964 # n = 1092.743 # sig.level = 0.025 # power = 0.8 # alternative = two.sided # # NOTE: same sample sizes pwr.2p.test(h=h2, n=NULL, sig.level=0.025, power=.80) # # Difference of proportion power calculation for binomial distribution # (arcsine transformation) # # h = 0.2837941 # n = 236.0353 # sig.level = 0.025 # power = 0.8 # alternative = two.sided # # NOTE: same sample sizes `

This route implies you'll need $1093$ participants in each condition. That's a lot of data! However, it may be closer to what you really want to be able to demonstrate. It's worth remembering at this point that there's very little information in a binary data point, there's less the closer the probability gets to the upper or lower bound, and $.15$ is really close to $.20$ (although I acknowledge that every life is precious so the small difference may nonetheless be clinically meaningful).

If you're really committed to using the linear probability model, and want to show that each condition differs from the others, we need to move to a simulation-based approach. How do you want to address the necessary heteroscedasticity? Among other options, you could use weighted least squares—I'll do that here. How do you want to conduct the multiple comparisons? There are lots of ways; in this case I'll use Tukey's test.

The power analyses above give me a ballpark estimate of where to start. This will require a lot of computation, so I take some steps to make it faster: I generate all the data and the weights ahead of time. I try to minimize the number of calculations that I'm asking R to perform, etc. Done this way, it only takes my old machine about 15 seconds. I'm assuming the analytical plan is to first determine if there is a significant global effect, and if so, you want to go further and show that all three conditions differ. Thus, you want four significant p-values for the study to be considered successful. That is, we are solving for *all-ways power* (see my linked answer at top).

`set.seed(906) # this makes the example exactly reproducible n = 1093 # number of patients per arm B = 1000 # number of iterations in the simulation p.mat = matrix(NA, nrow=4, ncol=B) # matrix to store the p-values cond = rep(c("C", "T1", "T2"), each=n) # condition variable y.mat = matrix(c(rbinom(n*B, size=1, prob=.20), # resulting data rbinom(n*B, size=1, prob=.15), rbinom(n*B, size=1, prob=.10) ), nrow=n*3, ncol=B, byrow=T) w.mat = matrix(NA, nrow=n*3, ncol=B) # matrix to store the weights i2s = n+1; i2e = 2*n; i3s = (2*n)+1; i3e = 3*n # row indexes for(j in 1:B){ # computing the weights w.mat[1:n,j] = 1/(n*mean(y.mat[1:n,j] )*(1-mean(mean(y.mat[1:n,j])))) w.mat[i2s:i2e,j] = 1/(n*mean(y.mat[i2s:i2e,j])*(1-mean(mean(y.mat[i2s:i2e,j])))) w.mat[i3s:i3e,j] = 1/(n*mean(y.mat[i3s:i3e,j])*(1-mean(mean(y.mat[i3s:i3e,j])))) } for(j in 1:B){ # fitting the models & storing the p-values m = aov(y.mat[,j]~cond, weights=w.mat[,j]) p.mat[1,j] = summary(m)[[1]][1,5] # global p-value p.mat[2:4,j] = TukeyHSD(m)$cond[,4] # 3 p-values for comparisons } ## power: i.e., the proportion of runs where all p's were significant mean(apply(p.mat, 2, function(j){ mean(j<.05)==1 })) # [1] 0.676 `

With this analytical strategy, using $n = 1093$ in each arm ($N = 3279$), I estimate you have $approx 68%$ power to show that all three conditions differ from each other. If you want, you can search over larger $n$'s to find how many patients it would take to achieve $80%$ power.

### Similar Posts:

- Solved – Calculating sample size of a mixed ANOVA in R (2 between, 1 within variable)
- Solved – How does power analysis differ between paired sample and independent groups t-test
- Solved – How does power analysis differ between paired sample and independent groups t-test
- Solved – Free internet or downloadable resources for sample size calculations
- Solved – R: Power analysis for a 2 by 2 within-within ANOVA interaction effect [pwr::pwr.f2.test()]