# Solved – Adding weights for highly skewed data sets in logistic regression

I am using a standard version of logistic regression to fit my input variables to binary output variables.

However in my problem, the negative outputs (0s) far outnumber the positive outputs (1s). The ratio is 20:1. So when I train a classifier, it seems that even features that strongly suggest the possibility of a positive output still have very low (highly negative) values for their corresponding parameters. It seems to me that this happens because there are just too many negative examples pulling the parameters in their direction.

So I am wondering if I can add weights (say using 20 instead of 1) for the positive examples. Is this likely to benefit at all? And if so, how should I add the weights (in the equations below).

The cost function looks like the following:
\$\$J = (-1 / m) cdotsum_{i=1}^{m} ycdotlog(h(xcdottheta)) + (1-y)(1 – log(h(xcdottheta)))\$\$

The gradient of this cost function (wrt \$theta\$) is:

\$\$mathrm{grad} = ((h(xcdottheta) – y)' cdot X)'\$\$

Here \$m\$ = number of test cases, \$x\$ = feature matrix, \$y\$ = output vector, \$h\$=sigmoid function, \$theta\$ = parameters we are trying to learn.

Finally I run the gradient descent to find the lowest \$J\$ possible. The implementation seems to run correctly.

Contents