I am trying to estimate a selection model of the form:

$Z_i = 1[alpha_0 + alpha_1X_{1,i} + alpha_2X_{2,i} + delta_i$ > 0]

$Y_i = beta_0 + beta_1X_{1,i} + Z_i + epsilon_i$

where $1[]$ denotes the indicator function.

The purpose of the model is to calculate the indirect effect of $X_1$ on $Y$ through $Z$, as well as the the direct effect.

My first question is how to go about estimating this type of model, and how this estimation can be achieved in R. As far as I see it I have a few possible approaches:

(1) Use a standard Heckman selection model, using OLS for both the reduced form and structural equations, using ivreg() in R. This will obviously ignore the constraint that $Z$ is bounded between 0 and 1.

(2) Estimate the first stage with a probit model (i.e. $delta_i sim N(0,1)$), and the second stage using standard OLS. I understand that I could do this via manual 2SLS, but as far as I am aware the standard errors will be incorrect? Am I right in that this model is feasible, and if so, can you direct me to a method of achieving this in R?

(3) Build a switching regression model (tobit-5) using the selection() function from sampleSelection package in R. I believe this model will estimate two equations for $Y$, one for where $Z_i=0$ and one where $Z_i=1$, and with a unique intercept and coefficients for each of the regressors in the outcome equations.

The question then is how to get an estimate of the indirect effect of $X_1$ for each of these methods.

If I use (1) or (2) then I imagine it might be possible to calculate the average marginal effect of $Z$ on $Y$, and the average marginal effect of $X_1$ on $Z$, then approximate the indirect effect by multiplying the two values?

If (3) then could I take the fitted value under the estimated model for $Y$ where $Z=0$, and compare the mean to the mean of the fitted values under the estimated model for $Y$ where $Z=1$? This would then give me an estimate of the marginal effect of $Z$? Then use the same method as above and multiple this effect by the marginal effect of $X_1$ on $Z$?

Many thanks in advance!

**Contents**hide

#### Best Answer

**First, some comments on estimation strategy:**

In the adult Wooldridge text (p. 623 in 1e, p. 939 in 2e), he suggests estimating a probit and calculating a predicted $hat z$. Then you use $hat z$ as an instrument for actual $z$ in ivreg. He recommends doing this over the manual two stage least squares with a probit first stage for improved efficiency.

Also, ordinary IV (ignoring the binary endogenous variable) should still be consistent and have correct errors. The estimates won't be as efficient (precise) as MLE (or 2-step) two-equation system (a probit selectiona with OLS outcome equation). With IV, the extra precision can be useful, especially if your sample is small. But if there's mis-specification in the binary selection equation, the systems approach will no longer be consistent for the main equation parameters that you care about. That's the cost penalty for imposing all that extra structure. I think the Wooldrige procedure is a nice intermediate solution between the two, but I would do all three to check robustness.

**Now an answer to your questions:**

I think the marginal effect of Z on Y is just the coefficient on Z in the main equation.

If you care about the effect of X on Y, why not just run that regression directly?

BTW, usually the endogenous variable is X and the instrument is Z, but I followed your notation in the answer.