Solved – Using grouped variables in regression

In a dataset, income is split up in different groups, e.g. person 1 is in income group 5, person 2 in income group 11 etc. The groups are of unequal size (i.e. group 1: 0 < x < 2500, group 11: 7000 < x < 100000).

I would like to model the edcuation level of a kindergarten child (assuming it depends on parental income). Do I have to use dummy variables for each category or can I, under the assumption of uniform distribution, use midpoints of each group and then use actual income values?

The usual approaches I tend to see are either

(i) to ignore the ordering and treat as nominal categories (thus throwing away a lot of potential information; or

(ii) use scores of some kind, often in a fairly arbitrary fashion (your midpoints would count), in effect imposing a lot of information you don't really have.

Neither is necessarily bad, they both make compromises.

The R package treats ordered factors differently from nominal factors, by fitting orthogonal polynomials to the numbered levels (i.e. it treats their value as 0,1,2 ..), which potentially overcomes the scoring issue; the first few such terms would tend to account for smooth functions of the level, but all possible terms would correspond to the fit of a nominal category (if set up in a somewhat more interpretable way).

Another alternative is to fit some kind of smooth monotonic function (perhaps a monotonic spline, say) to whatever scores you have.

Similar Posts:

Rate this post

Leave a Comment