I have an ordinal, categorical variable with five levels, of which the last two have only one observation for each. Should I leave them alone, omit them, incorporate them in another category, or do some other thing?

More generally, what is the strategy to address the situation where responses are too heavily skewed? I'd like as well to know reference readings if you have some.

**Contents**hide

#### Best Answer

Where there's the smoke of an ordinal variable, there's the fire of a latent continuous variable smoldering beneath it. If you can conceive of such a latent variable in this case, NonSleeper, then you have the opportunity to view your problem as one of *interval censoring*. To make this idea concrete, imagine you did a clinical trial on a very tight budget, where you could only afford a broken digital scale to weigh the subjects. The LCD display is damaged, such that only the 100's-place digit can be read. Your subjects' weights thus were coded 0 (<100lbs), 1 (100-199lbs), 2 (200-299lbs) and 3 (300+lbs). I note that CRAN lists several R packages that allow you to estimate certain types of model in the presence of interval censoring, although my own approach to interval censoring has been simply to use Bayesian methods that allow me flexibly to express my prior knowledge about the latent variable (say, a known, age-dependent distribution of weight in the population from which study subjects were drawn), and to incorporate other study measures (say, waist circumference) in a theoretically grounded way, to achieve the most efficient use of all study data.