Solved – Should dimensionality reduction for visualization be considered a “closed” problem, solved by t-SNE

I've been reading a lot about $t$-sne algorithm for dimensionality reduction. I'm very impressed with the performance on "classic" datasets, like MNIST, where it achieves a clear separation of the digits (see original article):


I've also used it to visualize the features learnt by a neural network I'm training and I was very pleased with the results.

So, as I understand it:

$t$-sne has good results on most datasets, and has a pretty efficient implementation – $O(n log n)$ with the Barnes-Hut approximation method. Then, could we potentially say that the "dimensionality reduction" problem, at least for the purpose of creating good 2D/3D visualizations, is now a "closed" problem?

I'm aware that this is a pretty bold statement. I'm interested in understanding what the potential "pitfalls" of this method are. That is, are there any cases in which we know that it is not useful? Moreover, what are the "open" problems in this field?

Definitely not.

I agree that t-SNE is an amazing algorithm that works extremely well and that was a real breakthrough at the time. However:

  • it does have serious shortcomings;
  • some of the shortcomings must be solvable;
  • there already are algorithms that perform noticeably better in some cases;
  • many t-SNE's properties are still poorly understood.

Somebody linked to this very popular account of some shortcomings of t-SNE: (+1), but it only discusses very simple toy datasets and I find that it does not correspond very well to the problems that one faces in practice when working with t-SNE and related algorithms on real-world data. For example:

  1. t-SNE often fails to preserve global structure of the dataset;
  2. t-SNE tends to suffer from "overcrowding" when $N$ grows above ~100k;
  3. Barnes-Hut runtime is too slow for large $N$.

I will briefly discuss all three below.

  1. t-SNE often fails to preserve global structure of the dataset.

    Consider this single cell RNA-seq dataset from the Allen institute (mouse cortical cells): It has ~23k cells. We know a priori that this dataset has a lot of meaningful hierarchical structure, and this is confirmed by hierarchical clustering. There are neurons and non-neural cells (glia, astrocytes, etc.). Among neurons, there are excitatory and inhibitory neurons — two very different groups. Among e.g. inhibitory neurons, there are several major groups: Pvalb-expressing, SSt-expressing, VIP-expressing. In any of these groups, there seem to be multiple further clusters. This is reflected in the hierarchical clustering tree. But here is t-SNE, taken from the link above:

    enter image description here

    Non-neural cells are in grey/brown/black. Excitatory neurons are in blue/teal/green. Inhibitory neurons are in orange/red/purple. One would want these major groups to stick together, but this does not happen: once t-SNE separates a group into several clusters, they can end up being positioned arbitrarily. The hierarchical structure of the dataset is lost.

    I think this should be a solvable problem, but I am not aware of any good principled developments, despite some recent work in this direction (including my own).

  2. t-SNE tends to suffer from "overcrowding" when $N$ grows above ~100k

    t-SNE works very well on the MNIST data. But consider this (taken from this paper):

    enter image description here

    With 1 mln data points, all clusters get clumped together (the exact reason for this is not very clear) and the only known way to counter-balance is with some dirty hacks as shown above. I know from experience that this happens with other similarly large datasets as well.

    One can arguably see this with MNIST itself (N=70k). Take a look:

    enter image description here

    On the right is t-SNE. On the left is UMAP, a new exciting method under active development, that is very similar to an older largeVis. UMAP/largeVis pull clusters much further apart. The exact reason for this is IMHO unclear; I would say there is still a lot to understand here, and possibly a lot to improve.

  3. Barnes-Hut runtime is too slow for large $N$

    Vanilla t-SNE is unusable for $N$ over ~10k. The standard solution until recently was Barnes-Hut t-SNE, however for $N$ closer to ~1mln it becomes painfully slow. This is one of the big selling points of UMAP, but actually a recent paper suggested FFT-accelerated t-SNE (FIt-SNE) that works much faster than Barnes-Hut t-SNE and is at least as fast as UMAP. I recommend everybody to use this implementation from now on.

    enter image description here

    So this might not exactly be an open problem anymore, but it used to be until very recently, and I guess there is room for further improvements in runtime. So work can certainly continue in this direction.

Similar Posts:

Rate this post

Leave a Comment