Solved – How to standard logistic regression model fractional response variable while denominator is available

I have X and Y variables, as well as a cluster variable (State). X and State are derived from Database A, while Y and State are derived from Database B.

X is a sentiment score ranging between -1 and 1, while Y is a yes or no (0 or 1) response.

In Database A, I aggregate X into average-X by state, while in Database B, I aggregate Y into percentage-Y by state. Then I combine the two datasets as follows:

enter image description here

In the combined data structure, my new outcome is percentage-Y, while I do have the numerator and denominator that give rise to percentage-Y.

I have heard that from here – "The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you know the denominator, you want to estimate such models using standard probit or logistic regression".

It seems since I do have the denominator information, I can avoid using the Fractional outcome regression and just stick with the standard Logistic regression.

However, how exactly can I model a logistic regression based on the denominator information?

First note that if you know the percentage and the denominator, then you also know the numerator. So, for example, if you know for a specific class (in your example, the class is state) that the ratio of positive to negative classes in a class is $0.6$, and the denominator of the ratio is $10$, then you immediately know that there are

  • 6 positive (y = 1) cases in that class.
  • 4 negative (y = 0) cases in that class.

With this information you can, in principle, create a new dataset expanding your grouped data. In this example you would end up with

  • 6 rows for the class with $y = 1$.
  • 4 rows for the class with $y = 0$.

Now you can use this new dataset to fit a logistic regression.

In practice, you simply observe that each row in this imaginary expanded data set contributes one term to the loss function

$$ L = sum_i y_i log(p_i) + (1 – y_i) log(1 – p_i) $$

and each of the expanded rows in a class where $y = 1$ contributes the same amount, with the same thing holding for the rows where $y = 0$. So, instead of actually physically creating the expanded data set, we can just apply integer weights to the terms in our loss function

$$ L = sum_i w_i y_i log(p_i) + w_i' (1 – y_i) log(1 – p_i) $$

where the $w$s and $w'$s are the number of positive and negative cases in each class.

Similar Posts:

Rate this post

Leave a Comment