I have X and Y variables, as well as a cluster variable (State). X and State are derived from Database A, while Y and State are derived from Database B.

X is a sentiment score ranging between -1 and 1, while Y is a yes or no (0 or 1) response.

In Database A, I aggregate X into average-X by state, while in Database B, I aggregate Y into percentage-Y by state. Then I combine the two datasets as follows:

In the combined data structure, my new outcome is percentage-Y, while I do have the numerator and denominator that give rise to percentage-Y.

I have heard that from here – "The most natural way fractional responses arise is from averaged 0/1 outcomes. In such cases, if you **know the denominator**, you want to estimate such models using **standard** probit or logistic regression".

It seems since I do have the denominator information, I can avoid using the Fractional outcome regression and just stick with the standard Logistic regression.

However, how exactly can I model a logistic regression based on the denominator information?

**Contents**hide

#### Best Answer

First note that if you know the percentage and the denominator, then you also know the numerator. So, for example, if you know for a specific class (in your example, the class is state) that the ratio of positive to negative classes in a class is $0.6$, and the denominator of the ratio is $10$, then you immediately know that there are

- 6 positive (y = 1) cases in that class.
- 4 negative (y = 0) cases in that class.

With this information you can, in principle, create a new dataset expanding your grouped data. In this example you would end up with

- 6 rows for the class with $y = 1$.
- 4 rows for the class with $y = 0$.

Now you can use this new dataset to fit a logistic regression.

In practice, you simply observe that each row in this imaginary expanded data set contributes one term to the loss function

$$ L = sum_i y_i log(p_i) + (1 – y_i) log(1 – p_i) $$

and each of the expanded rows in a class where $y = 1$ contributes the *same amount*, with the same thing holding for the rows where $y = 0$. So, instead of actually physically creating the expanded data set, we can just apply integer weights to the terms in our loss function

$$ L = sum_i w_i y_i log(p_i) + w_i' (1 – y_i) log(1 – p_i) $$

where the $w$s and $w'$s are the number of positive and negative cases in each class.

### Similar Posts:

- Solved – How to standard logistic regression model fractional response variable while denominator is available
- Solved – Response variable: percentage and too many zeros (zero inflated Poisson?)
- Solved – How to create a regression model object from intercept and coefficients values only (without the database) in R
- Solved – SVM vs Logistic Regression
- Solved – What would be a suitable way to present a prevalence rate