I would like to generate some artificial data to evaluate an algorithm for classification (the algorithm induces a model that predicts posterior probabilities).

These are some basic properties of the dataset:

- Features have to be continuous
- Response variable is dichotomous (either 0 or 1)

I would like to test whether the algorithm can cope with:

- Many feature / high dimensional problems
- noise (it can drop features)
- Multi-modality
- ??? (how do I simulate correlation etc.)

I intend to implement the algorithm in R or Matlab. I can sample from multivariate normal distributions and specify a covariance matrix.

I would appreciate any feedback.

**Contents**hide

#### Best Answer

Some idea might be to generate something like the Madelon set from NIPS 2003 challenge; it fits your requirements pretty well.

You can generate a set like this starting with `mlbench.xor`

(or `mlbench.hypercube`

, might be easier) form mlbench package, then you combine classes it generated into two groups to make the dichotomous response and add new attributes to increase dimensionality — some being random linear combinations of the original ones, some being just random noise.