I would like to generate some artificial data to evaluate an algorithm for classification (the algorithm induces a model that predicts posterior probabilities).
These are some basic properties of the dataset:
- Features have to be continuous
- Response variable is dichotomous (either 0 or 1)
I would like to test whether the algorithm can cope with:
- Many feature / high dimensional problems
- noise (it can drop features)
- Multi-modality
- ??? (how do I simulate correlation etc.)
I intend to implement the algorithm in R or Matlab. I can sample from multivariate normal distributions and specify a covariance matrix.
I would appreciate any feedback.
Best Answer
Some idea might be to generate something like the Madelon set from NIPS 2003 challenge; it fits your requirements pretty well.
You can generate a set like this starting with mlbench.xor
(or mlbench.hypercube
, might be easier) form mlbench package, then you combine classes it generated into two groups to make the dichotomous response and add new attributes to increase dimensionality — some being random linear combinations of the original ones, some being just random noise.