I have X and Y variables, as well as a cluster variable (State). X and State are derived from Database A, while Y and State are derived from Database B.
X is a sentiment score ranging between -1 and 1, while Y is a yes or no (0 or 1) response.
In Database A, I aggregate X into average-X by state, while in Database B, I aggregate Y into percentage-Y by state. Then I combine the two datasets as follows:
In the combined data structure, my new outcome is percentage-Y, while I do have the numerator and denominator that give rise to percentage-Y.
I have heard that from here – "The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you know the denominator, you want to estimate such models using standard probit or logistic regression".
It seems since I do have the denominator information, I can avoid using the Fractional outcome regression and just stick with the standard Logistic regression.
However, how exactly can I model a logistic regression based on the denominator information?
First note that if you know the percentage and the denominator, then you also know the numerator. So, for example, if you know for a specific class (in your example, the class is state) that the ratio of positive to negative classes in a class is $0.6$, and the denominator of the ratio is $10$, then you immediately know that there are
- 6 positive (y = 1) cases in that class.
- 4 negative (y = 0) cases in that class.
With this information you can, in principle, create a new dataset expanding your grouped data. In this example you would end up with
- 6 rows for the class with $y = 1$.
- 4 rows for the class with $y = 0$.
Now you can use this new dataset to fit a logistic regression.
In practice, you simply observe that each row in this imaginary expanded data set contributes one term to the loss function
$$ L = sum_i y_i log(p_i) + (1 – y_i) log(1 – p_i) $$
and each of the expanded rows in a class where $y = 1$ contributes the same amount, with the same thing holding for the rows where $y = 0$. So, instead of actually physically creating the expanded data set, we can just apply integer weights to the terms in our loss function
$$ L = sum_i w_i y_i log(p_i) + w_i' (1 – y_i) log(1 – p_i) $$
where the $w$s and $w'$s are the number of positive and negative cases in each class.
- Solved – How to standard logistic regression model fractional response variable while denominator is available
- Solved – Response variable: percentage and too many zeros (zero inflated Poisson?)
- Solved – How to create a regression model object from intercept and coefficients values only (without the database) in R
- Solved – SVM vs Logistic Regression
- Solved – What would be a suitable way to present a prevalence rate