I am confused on how to calculate the beta in a probit/logit model

Probit Model

$P(Y_i=1)=Phi(X'beta)$

Logit Model: $P(Y_i=1)=e^{X'beta}/(1+e^{X'beta})$

These formula's are great but how do I calculate the beta's in the model? What is a proper estimator for beta hat?

**Contents**hide

#### Best Answer

It is hard to answer your question, as you do not provide the background you have. Here is a general answer.

You want to find $mathbb{P}[Y_i = 1 | X_i]$ when you have a bunch of data that comes from the same distribution, that is, the result ($Y_i$) is influenced the same way by the inputs ($X_i$) of each sample you have. Read more: I.I.D. (Wikipedia).

You know $X_i$ and $Y_i$ for your samples, and you want to find a function that maps the inputs to the outputs as close as possible. This is what the probit/logit models are doing, and they have a parameter $beta$ to dictate the influence of the different input variables. To find the best $beta$, we have first to define what *the best* is.

In linear regression models, the *best* model is often defined by the model that has the best $R^2$ measure with the output, or the one that has the smallest Mean Squared Error (Wikipedia). In classification tasks, a common measure is the Logistic loss (Wikipedia).

The value of the loss function is $L(y, f(x,beta))$ where $L$ is the loss function, $x,y$ the input and output variables, and $beta$ are the parameters of your model. It is a comparison of your model with the real result $y$, and gives you a metric to evaluate your model. You want to have the lowest error with respect to $beta$, $$hat{beta} = arg min_{beta} L(y,f(x,beta))$$ Depending on your model and cost function, you can equate the derivative to 0 and compute the $beta$ that makes it possible based on the data. In the case of the transformed models you are interested about, this is not possible, as as mentionned in the comments, no *closed solution* exists. However, you can come very close to the optimal solution by using gradient descent (Wikipedia). The idea is that the derivative of the function at a certain $beta$ will point to the direction of highest increase in the cost function with respect to $beta$, and if you go in the other direction, your $beta$ will improve and the error will lower. As long as your cost function is *convex*, you will fill a global minimum, the optimal $beta$.

How to find the parameters of the logistic regression is the introduction of a lot of books and classes in Machine learning. For a more specific answer, I'd advise looking at

- Elements of Statistical Learning, by Hastie, Tibshirani and Friedman (Book website, free download)
- Pattern Recognition and Machine Learning, by Christopher Bishop.
- Machine Learning: A Probabilistic Perspective, by Kevin Murphy.
- Andrew Ng's Mooc
- Any other book/mooc/class on statistical/machine learning will introduce this concept.

You can also take a look at questions on a similar topic on CrossValidated,