I'm trying to find a simple way to split some longitudinal data into a training and test set. I'm familiar with using the Caret package in R to make stratified splits, but only with wide-form data. It looks like somebody has written a function to do this in Python, but I haven't learned that language yet.
In my case, I'd like to make a stratified split on some outcome classification (that, incidentally, does not change over time), on a data set where each individual has more than one observation, in such a way that if an individual is in one of the training/test sets, then all of their observations are in that same set.
I'd like to avoid having to transpose, then split, then transpose both training and test sets back to longitudinal format.
The only way I can think to do this (so far) is in the following code that I built using various sources on this site (1, 2), but I'm not sure that this is (a) 100% accurate or (b) if there isn't a better solution.
library(dplyr) set.seed(1) train <- data %>% select(ID, outcome) %>% distinct %>% group_by(outcome) %>% sample_frac(0.8) %>% left_join(data) test <- data[!(data$ID %in% train$ID), ]
Best Answer
Just use sample()
to choose some number of groups, after converting your id to factor
For example:
smp_size <- floor(0.80 * length (unique (iris$Species))) iris %>% filter(Species %in% sample(levels(Species),smp_size))
Similar Posts:
- Solved – Splitting an imbalanced dataset for training and testing
- Solved – Splitting data into test/train set vs. using k-fold cross validation
- Solved – Intuitive explanation of stratified cross validation and nested cross validation
- Solved – Intuitive explanation of stratified cross validation and nested cross validation
- Solved – Why does randomForest has higher test AUC than train AUC? Is this possible?