In a dataset, income is split up in different groups, e.g. person 1 is in income group 5, person 2 in income group 11 etc. The groups are of unequal size (i.e. group 1: 0 < x < 2500, group 11: 7000 < x < 100000).

I would like to model the edcuation level of a kindergarten child (assuming it depends on parental income). Do I have to use dummy variables for each category or can I, under the assumption of uniform distribution, use midpoints of each group and then use actual income values?

**Contents**hide

#### Best Answer

The usual approaches I tend to see are either

(i) to ignore the ordering and treat as nominal categories (thus throwing away a lot of potential information; or

(ii) use scores of some kind, often in a fairly arbitrary fashion (your midpoints would count), in effect *imposing* a lot of information you don't really have.

Neither is necessarily bad, they both make compromises.

The R package treats ordered factors differently from nominal factors, by fitting orthogonal polynomials to the numbered levels (i.e. it treats their value as 0,1,2 ..), which potentially overcomes the scoring issue; the first few such terms would tend to account for smooth functions of the level, but all possible terms would correspond to the fit of a nominal category (if set up in a somewhat more interpretable way).

Another alternative is to fit some kind of smooth monotonic function (perhaps a monotonic spline, say) to whatever scores you have.

### Similar Posts:

- Solved – How to measure the magnitude of a change in distribution across time
- Solved – Why do we create dumthe variables?
- Solved – One-way ANOVA or two-way ANOVA for more than 10 independent variables and 1 dependent variable
- Solved – Comparing the result of a study which has unequal group sizes
- Solved – Comparing the result of a study which has unequal group sizes