Solved – Kernel density estimation vs. machine learning for forecasting in large samples

This is a hypothetical and pretty general question. Apologies if it is too vague. Suggestions on how to better focus it are welcome.

Suppose you are interested in the relationship between one endogenous variable $y$ and a few exogenous variables $x_1,…,x_k$. The ultimate goal is forecasting new realizations of $y$ given new realizations of $x$'s. You have little clue what functional form the relationship could take.

Suppose you have a sufficiently large sample, so that you may obtain a reasonably accurate estimate of the joint probability density (by kernel density estimation or similar) of $y$ and the $x$'s.

Then you could use
(A) kernel density estimation (or some similar alternative);
(B) machine learning techniques (penalized regression like LASSO, ridge, elastic net; random forests; other)

(There are certainly other alternatives, but including those would make the question way too wide.)


  1. When would you prefer A over B and when B over A?
  2. What would be the key determinants of the choice?
  3. What main trade-offs do we face?

Feel free to comment on special cases and add your own assumptions.

(First off, I'd consider kernel density estimation a form of a machine learning model, so that's a strange dichotomy to make. But anyway.)

If you really do have enough samples to do good density estimation, then the Bayes classifier formed via KDE, or its regression analogue the Nadaraya-Watson model, converges to the optimal model. Any drawbacks of this approach are then purely computational. (Naive KDE requires comparing each test point with every single training point, though you can get much better than that if you're clever.) The other problem is the enormous issue of bandwidth selection, but with a good enough training set this is again only a computational issue.

In practice, however, you rarely actually have a good enough sample to perform highly accurate density estimation. Some issues:

  • As the dimension increases, KDE rapidly needs many more samples; vanilla KDE is rarely useful beyond the order of 10 dimensions.
  • Even in low dimensions, a density estimation-based model has essentially no ability to generalize; if your test set has any examples outside the support of your training distribution, you're likely screwed.

The reason for this drawbacks is that density estimation-type models assume only that the function being learned is fairly smooth (with respect to the kernel). Other models, by making stronger assumptions, can learn with many fewer training points when the assumptions are reasonably well-met. If you think it's likely that the function you're trying to learn is more or less a sparse linear function of its inputs, then LASSO will be much better at learning that model with a given number of samples than KDE. But if it turns out to be $f(x) = begin{cases} 1 & lVert x rVert > 1\0 & text{otherwise}end{cases}$, LASSO will do essentially nothing and KDE will learn more or less the right model pretty quickly.

Similar Posts:

Rate this post

Leave a Comment