Solved – Learning to create samples from an unknown distribution

I am interested in new generating samples to approximate some unknown distribution X, where each new sample is a real-valued vector.

The purpose it to be able to create a new (arbitrary large) stream of new samples from this approximate distribution that will be distributed in the same way, or as close as possible, to the original sampled data.

Some additional points:

  • I will have a large number of samples from X, e.g. in the millions, and possibly too large to fit in memory.
  • The probability distribution X could be either discrete or continuous. It is likely to be multi-modal. Very extreme values are unlikely.
  • If needed I can normalise the data or scale it to fit some bounds.
  • Dimensionality of each sample is reasonably large (say 1000)
  • Samples can be assumed to be independent
  • Samples can be assumed to be almost identically distributed, although they represent a time series so it is possible that the underlying distribution may be changing very slowly. This change is unlikely to be large enough to
  • Ideally I'd like the algorithm to be online, i.e. the model distribution can be updated incrementally as new real samples become available.

What is the best algorithm to "learn" how to generate new samples with a probability distribution that approximates X as closely as possible?

Basically, it sounds like you want to bootstrap your data:

A good (and relatively cheap) reference is: "Bootstrap Methods and Their Applications" by A. C. Davison and D. V. Hinkley (1997, CUP).

which has an associated R package, "boot".

BUT… there's a lot that can go wrong in bootstrapping and it's very easy to get misleading results if you don't know what you're doing (which, to be blunt, sounds likely). It would help a lot if you explained exactly what the problem is that you're trying to solve.

Similar Posts:

Rate this post

Leave a Comment