Recently, I have been reading about Propensity Score Matching :

If I have understood this correctly, Propensity Score Matching is used to construct control/treatment groups in scientific studies, in such a way that individuals in the control group are as similar as possible to individuals in the treatment group. In other words, an individual in one of these groups is "matched" to an equivalent individual in the other group. This is done to reduce the risk of "latent and unobserved" variables cofounding the effects of the treatments – and that "apples are compared to apples" instead of "apples to oranges".

At the surface, this seems to be very important – after all, if we are testing the effects of some pharmaceutical drug on two similar groups of people, we would like to avoid the risk of one these groups being comprised primarily of Olympian athletes and the other of senior citizens (assuming the goal of the study is to compare the effects of the drug on similar groups of people).

**My Question:** Just to clarify – do most researchers attempt to implement some form of Propensity Score Matching when conducting these kinds of statistical studies? Is this a "**must**"?

If some form of Propensity Score Matching is not implemented properly relative to the objective of the study, does this pose a high risk of invalidating the statistical study? According to the Wikipedia article (https://en.wikipedia.org/wiki/Propensity_score_matching), Propensity Score Matching was popularized in the 1980's – does this suggest that statistical studies conducted prior to the 1980's were more likely to suffer from these kinds of undesired variable confounding effects?

**Contents**hide

#### Best Answer

Propensity score methods are one type of method used to adjust for confounding. There are several other methods that rely on different assumptions. Some of the most popular include difference-in-differences, which relies on an assumption about stability over time, and instrumental variable analysis, which relies on an assumption about randomization of some other variable. A third class of methods includes methods that rely on an assumption that all confounding variables have been measured. I highly recommend this 2020 article by Matthay et al. for a comparison of these methods.

Propensity score methods fall in the latter class. Other methods also fall in this class, including regression adjustment, "g"-methods, and doubly-robust methods. These are all different ways of adjusting for confounding by measured covariates by conditioning on them in certain ways. They differ primarily in their statistical performance under various assumptions about the functional form of the treatment and outcome processes.

There are several ways to use propensity scores, including matching (which you described), weighting, subclassification, and regression adjustment, and there are ways to perform each of these methods without propensity scores. I mention all of this so that you see propensity scores as one particular implementation of methods that themselves are members of a broad class of methods that is one of several classes of methods one can use to adjust for confounding. Propensity score methods are not necessarily superior to any of them, and their ubiquity is likely a cultural artifact rather than truly justified by their statistical performance.

Here are a few reasons (and rebuttals) for why propensity score may be popular:

- They are easy to implement (but only in their most basic, poorest performing way; to use them well requires extensive knowledge)
- They are easy to explain to lay audiences (but so are many methods that don't involve propensity scores, like other matching methods)
- They tend to be effective at removing bias due to confounding (but several methods are demonstrably better, especially better than propensity score methods as most commonly used)
- They separate the design and analysis phase, leading to more replicable research and decreasing model dependence (but when used poorly can increase model dependence and are not immune to snooping and nefarious or misguided use)
- They are implemented in most statistical software (but so are many other methods, and they are implemented differently in each software)
- They are a form of dimension reduction in high-dimensional datasets (but there are other ways to reduce dimensionality, and still propensity scores are used even to adjust for a few covariates)
- They rely less on modeling assumptions than regression-based methods (but there are many other methods that also allow for extreme flexibility with often improved performance)
- They sound fancy and make the analyst look sophisticated (but experienced statisticians can easily point out the errors amateur users constantly make)

(You might think I am biased against propensity scores, but check the propensity-scores tag and see my involvement. I'm also the author of several R packages to facilitate the use of propensity score methods.)

In my opinion, propensity scores are overused (or, at best, under-justified) in the medical literature. There are so many better performing and more sophisticated methods that rely on the same assumptions as propensity score methods do that are under-appreciated in medical research, often because the analysts and reviewers in medical research are not familiar with them. I hope to encourage people to consider propensity scores as *one* option in a vast sea of options, each of which has its own advantages and disadvantages that make it more or less suitable for a given problem. To decide which option is the best for a given problem requires the assistance of a statistician specially trained in the area of causal effect estimation.