I am learning xgboost and am planning on running a tree model. My dataset includes repeated measures. In a GLMM I would include the ID to account for repeated measures and I'm curious if I should do this with xgboost.

Another approach would be to transform my dataset from a long dataset into a wide (e.g. create different columns for each time the item was measured).

Would I violate any assumptions of a xgboost tree model if I were to drop the ID column and solely include the meaningful predictors?

**Contents**hide

#### Best Answer

You are correct to worry about using clustered data and then ignoring their inherit clustering. This can lead to information leakage as the cluster/subject-specific variance patterns might dictate patterns that do not generalise to the underlying population, i.e. lead us to over-fit our sample data. To that extent, ignoring the subject information altogether, again does not protect us from over-fitting; our learner might detect subject-specific patterns by itself.

A partial work-around for this issue is relatively straightforward. We do *not* segment our available data completely at random but instead we design our training and test set in such a way that measurements from the same subject exist either exclusively in the training or exclusively in the test set. This is easy to implement as we simply need to sample subjects instead of raw measurements. We might still over-fit subject specific patterns during training but theoretically these will be penalised during testing and thus lead us to a more universal representation of our learning task. To paraphrase Karpievitch et al. (2009) An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++: this is effectively the idea of growing "*each tree on a bootstrap sample (a random sample selected with replacement) at the subject-level rather than at the replicate-level of the training data*".

More theoretically, there has been some work particularly on the use of GBMs for clustered data (e.g. Groll & Tutz (2012) Regularization for Generalized Additive Mixed Models by Likelihood-Based Boosting or Miller et al. (2017) Gradient Boosting Machine for Hierarchically Clustered Data) that I think can be insightful for what you want. The basic idea in these works, is that given some initial estimates for our fixed effects, random effects and variance components (e.g. through `lme4::lmer`

), we compute the estimates for the new fixed and random effects through gradient boosting of the penalized likelihood function. Then considering those fixed, we re-estimate the variance components. We then do this a number of times till satisfactory convergence in an E-M like approach.

A general point is that the difference between fixed and random effects is often a matter of convenience and/or existing nomenclature (see the thread: What is the difference between fixed effect, random effect and mixed effect models? for more details). Depending on the particular task, certain factors can be seen as random or as fixed. I believe that the most important thing is to ensure that we do not make unreasonable assumptions.

*Some final blue-sky thoughts*: 1. There is *one* core ML problem that is solved via GBMs and is concerned with clustered data: learning to rank. In that scenario, a *query* is the unit of analysis and the subsequent metrics (e.g. Mean reciprocal rank or (Normalised) Discounted cumulative gain) are all relevant per unit of analysis. You might be able to get some ideas from there too. 2. There are implementations for regression trees and random forests that are specifically developed for clustered data (at first instance, see Hajjem et al. (2011) Mixed Effects Regression Trees for Clustered Data and Hajjem et al. (2014) Mixed-effects random forest for clustered data respectively). Somewhat simplistically, I assume that if these procedures are used as base-learners in a boosting framework, then the boosting procedure should behave coherently when used with clustered data.