Solved – Is ID column in a regression bad

Is it always a bad idea to have something like user_id or store_id in a regression or tree ensemble?

Edit: I have a dataset of cyclists and tracks. I'm trying to use a tree based regressor to predict finishing times. I'm using qualities of the track (distance, shape) and cyclist (age, years active) and some window functions like last_finishing_time, lifetime_avg_finish_at_this_track.

I was wondering if it's worth throwing in track_id or cyclist_id as a categorical variable. My fear is that it will create a ton of one hot variables (eg. track_id_5, track_id_6, cyclist_10, cyclist_11) that will cause issues in my analysis.

My answer concerns your problem in an ordinary least squares regression setting.

First and foremost: If you only have one observation per individual, then including an indicator variable for the user ID would result in that you have more parameters to estimate than observations, leading to an unidentifiable model. If by including it in model you mean that you'd use it as a continuous variable, then it is very unclear what the corresponding regression coefficient would even mean. Sure, there might be corner-cases where it is reasonable to do so. For instance, if the first individual given an ID number is given ID 1, the second is given ID number 2, and so on, then maybe ID could be included as a continuous variable if you want to adjust for order of inclusion. But as a general rule of thumb: no.

Now, if you have more than one observation per individual, your observervations are not independent of each other and that needs to be acounted for in the analysis. On way of doing that would be to include the ID in the regression model. However, if you consider the individuals as random representatives of some population, then this is not advisable, since prediction for new individuals would be difficult. (They would not have the same ID as any individual in the study.) In those cases, I'd instead look into using a linear mixed model and include a random intercept for each individual.

Similar Posts:

Rate this post

Leave a Comment