I am trying to understand if I should be more concerned with covariate measurement error in linear models including dummy variables, than with all continuous predictors. Say I have a simple linear model with one independent variable (IV) and one covariate:

begin{align}

Y_i = beta_0 +beta_1X_i+beta_2Z_i+epsilon_i

end{align}

where $Y_i$ is the dependent variable, my error term $epsilon_i sim N(0,sigma_{epsilon}^2)$, $X_i$ is my independent variable, and $Z_i$ my covariate measured with a stochastic measurement error $eta_i sim N(0,sigma_{eta}^2)$. Does the measurement error affect the estimates of the effect of my IV: $beta_1$ any worse for a dichotomous vs. continuous IV?

**Bonus**. How are the IV estimates affected by measurement errors when there is 1) a systematic measurement error of the covariate 2) more than one IV 3) more than one covariate with independent errors 4) more than one covariate but with correlated errors.

It seems to me there should not be a qualitative difference, but a colleague told me that ANCOVA results are very unstable with respect to measurement errors in the covariates, as opposed to regression. This worries me because I often use dummy variables in a regression where the covariates have a sizable measurement error.

*My attempt at a solution.*

say $y$, $x_1…x_{n-1}$ and $z = z^*+u$ are particular realizations of the variables in the model, and u is the measurement error.

Using the standard formula for the estimates $$hatbeta = (X^TX)^{-1}X^Ty$$ so $$hatbeta_i= frac{1}{2|text{cov}(X,X)|}sum_{j=1}^n epsilon_{jpq}epsilon_{ikl}text{cov}_{pk}(X,X)text{cov}_{ql}(X,X)text{cov}_{j1}(X,y)$$

where the permutation symbol $$epsilon_{ijk} = left{ begin{array} \ 1 & i,j,k = 1,2,3; 2,3,1 text{or} 3,1,2 \ -1 & i,j,k = 3,2,1; 2,1,3 text{or} 1,3,2 \ 0 & text{else}end{array} right.$$

now measurement error will only affect the $hat beta_i$ where the covariance matrix element cov$_{jk}(X,cdot)$ includes the covariate with the error terms. That doesn't seem to be affected by whether we use a dichotomous or continuous variable. So the only thing that could be affected is the determinant in the denominator, so we probably should be weary of collinearity that can arise with many dummy variables, other than that I don't see how using dichotomous variables can be any worse.

**Contents**hide

#### Best Answer

## How does measurement error in a binary predictor affect bias?

Starting from the single variable case, we observe a binary variable $x$ which is the true predictor measured with error, $X = x + u$ (I drop the $i$ subscripts for convenience). For the sake of the illustration and because much of this work was done in that field, let's suppose $X$ is a disease and $x$ is a doctor's diagnosis. Let's define the following quantities:

- $P$ is the proportion of people in the population who truly have the disease
- $tilde{P}$ is the proportion of people diagnosed with the disease according to our doctor, then $tilde{Q} = (1-tilde{P})$ is the proportion of those diagnosed as healthy
- $eta$ is the proportion of people who truly have the disease but are classified as not having the disease
- $nu$ is the proportion of people who are truly healthy but who are classified as having the disease

The errors-in-variables framework is then $$P = (1-nu)tilde{P} + eta tilde{Q}$$ in order to allow misclassification into both directions. From the set-up the marginal distributions of $X$ and $x$ are Bernoulli with parameter $P$ and $tilde{P}$, respectively.

Savoca (2000) derives the quantities needed for evaluating the bias of OLS, $$ begin{align} E(u) &= nu – (eta + nu)P \ Var(u) &= nu + (eta – nu)P – left[nu – (eta + nu)Pright]^2 \ Cov(x,u) &= -(eta + nu)P(1-P) end{align} $$

So as compared to the classical measurement error in a continuous explanatory variable the error here does not have zero mean unless $E(X) = E(x) = P$ – but this would mean that there is no misclassification in our diagnosis.

The corresponding coefficient from the above regression with one binary regressor with error would be $$widehat{beta} = beta left[ frac{P(1-P)(1-nu-eta)}{tilde{P}(1-tilde{P})} right] $$

The resulting bias, as in any other measurement error case, is towards zero. This has been shown as early as Aigner (1973).

## Is measurement error worse for binary or continuous predictors?

In terms of whether measurement error in a binary variable is worse than that in a continuous variable it is not necessarily obvious which case has a larger bias. Consider the following simulation exercise (using Stata). First we try the errors-in-variables framework with a binary predictor:

`set seed 777 set obs 1000 * suppose the true P = 0.44 * generate our true X gen X = rbinomial(1, 0.44) * generate some error to be used in constructing the observed x gen e = rnormal(0,1) gen error = (e>2) | (e<-1.5) * generate the observed x (with error) gen x = X replace x = 0 if X==1 & err==1 replace x = 1 if X==0 & err==1 * generate the dependent variable with true beta = 1.2 gen eps = rnormal(0,1) gen y = 1 + 1.2*X + eps * regression with measurement error reg y x `

The result is

` Source | SS df MS Number of obs = 1000 -------------+------------------------------ F( 1, 998) = 180.45 Model | 208.290565 1 208.290565 Prob > F = 0.0000 Residual | 1152.00081 998 1.15430943 R-squared = 0.1531 -------------+------------------------------ Adj R-squared = 0.1523 Total | 1360.29137 999 1.36165303 Root MSE = 1.0744 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .9171922 .0682789 13.43 0.000 .7832055 1.051179 _cons | 1.109043 .0458538 24.19 0.000 1.019062 1.199024 ------------------------------------------------------------------------------ `

So that's pretty far off the true value. The correlation between $X$ and $x$ here is 0.8 and if we consider a similar correlation for their continuous versions,

`*specify a correlation matrix matrix C = (1, 0.8, 0 0.8, 1, 0 0, 0, 1) * simulate the data (fortunately here we can use a Stata function rather than doing all by hand) corr2data X x e, n(10000) means(0.5 0.5 0) sds(0.5 0.5 1) corr(C) gen y = 1 + 1.2*X + e `

the result is

`. reg y x Source | SS df MS Number of obs = 10000 -------------+------------------------------ F( 1, 9998) = 2039.25 Model | 2303.76962 1 2303.76962 Prob > F = 0.0000 Residual | 11294.8704 9998 1.12971298 R-squared = 0.1694 -------------+------------------------------ Adj R-squared = 0.1693 Total | 13598.64 9999 1.36 Root MSE = 1.0629 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | .96 .0212587 45.16 0.000 .9183288 1.001671 _cons | 1.12 .0150318 74.51 0.000 1.090535 1.149465 ------------------------------------------------------------------------------ `

which is only marginally closer to the true value than what we had in the binary case. However, what will happen in reality is going to depend on several factors. In the binary case you can already see how many parameters affect the resulting bias, i.e. the true prevalence and the observed prevalence of the disease, and our error rates among the classes.

In this sense you can probably come up with settings of $P$, $tilde{P}$, $nu$, and $eta$ that are much closer to the true value (or much further off) than the coefficient of a continuous variable with measurement error. For the latter only the signal to noise ratio matters regarding the size of the bias, i.e. how large is our measurement error.

## What happens if the measurement error in the binary variable is systematic?

If you want to investigate the case of non-standard measurement error in the binary case (which is actually already non-standard), then simply set $nu = 0$ or $eta = 0$ in order to create a scenario where we make a systematic classification error in either direction. I'm not aware though of a paper which looks exactly at such a setting.

## What happens to other explanatory variables?

In the binary case with measurement error it can be shown that this bias affects all other explanatory variables unless they are uncorrelated with the mismeasured binary predictor. The relevant reference for this would again be Savoca (2000).

Regarding the final questions "3) more than one covariate with independent errors 4) more than one covariate but with correlated errors", I'm not sure if this means two (or more) covariates and all of them are measured with error. Or one of the covariates is measured with error and the error is correlated with the regression error. In the latter case we go back to the non-standard errors-in-variables framework. For multiple covariates that are measured with error the bias will depend on each variables measurement error and the correlation between the covariates. If all covariates are uncorrelated then we can assess the measurement errors separately if they are continuous. If they are not continuous then we are again in the Savoca framework.

As concerns question 4) I am not aware of a paper that derives the exact bias of such a case. Correlation between errors and with several covariates with measurement error is presumably a very complex case for which it will be difficult to find a closed form solution to the bias unless one is willing to make strong assumptions on the relationship between the errors and they distributions.

**Caveat:** despite it's length I am sure that this answer does not answer all your questions to the extent to which it would have helped you the most. It is not possible to consider all scenarios in equal detail so I tried to focus on certain aspects that I thought would be the most important and tried to highlight relevant literature for the other parts.

### Similar Posts:

- Solved – Normalizing a Continuous Variable for Appropriate Use Alongside Binary Variables
- Solved – mix different data transformations in the same model
- Solved – How to specify ANCOVA interactions in SPSS
- Solved – How to specify ANCOVA interactions in SPSS
- Solved – How to specify ANCOVA interactions in SPSS