# Solved – Can any dataset be clustered or does there need to be some sort of pattern in the data

If a clustering algorithm (e.g., Ward's clustering algorithm; based on the way various stimuli were rated on several continuous scales) succeeds (fulfils its math. objective function) in clustering a set of data, does that suggest that there is indeed a meaningful set of clusters in the data? Or is any set of data "clusterable"? If it's the latter, how can we distinguish meaningful vs. non-meaningful clustering?

Contents

It seems to me there are two different primary goals one might have in clustering a dataset:

1. Identify latent groupings
2. Data reduction

Your question implies you have #1 in mind. As other answers have pointed out, determining if the clustering represents 'real' latent groups is a very difficult task. There are a large number of different metrics that have been developed (see: How to validate a cluster solution?, and the section on evaluating clusterings in Wikipedia's clustering entry). None of the methods are perfect, however. It is generally accepted that the assessment of a clustering is subjective and based on expert judgment. Furthermore, it is worth considering that there may be no 'right answer' in reality. Consider the set {whale, monkey, banana}; whales and monkeys are both mammals whereas bananas are fruits, but monkeys and bananas are colocated geographically and monkeys eat bananas. Thus, either grouping could be 'right' depending on what you want to do with the clustering you've found.

But let me focus on #2. There may be no actual groupings, and you may not care. A traditionally common use of clustering in computer science is data reduction. A classic example is color quantization for image compression. The linked Python documentation demonstrates compressing "96,615 unique colors to 64, while preserving the overall appearance quality":

Another classic application of clustering in computer science is to enhance the efficiency of searching a database and retrieving information.

The idea of reducing data is very counter-intuitive in a scientific context, though, because usually we want more data and richer information about what we're trying to study. But pure data reduction can occur in scientific contexts as well. Simply partitioning a homogeneous dataset (i.e., no actual clusters) can be used in several contexts. One example might be blocking for experimental design. Another might be identifying a small number of study units (e.g., patients) that are representative of the whole set in that they span the data space. In this way, you can get a subsample that could be studied in much greater detail (say, structured interviews) that wouldn't be logistically possible with the full sample. The same idea can be applied to make it possible to visualize large, complex, and high dimensional datasets. For instance, when trying to plot longitudinal data on many patients with many measurement occasions, you will typically end up with what's called a 'spaghetti plot' (due to the resulting inability to see anything of value), but it may be possible to plot a smaller number of representative patients yielding lines that can be individual discerned but that collectively represent the data reasonably well.

Other examples are possible, but the point is that a clustering can be successful without there being any actual cluster structure at all. You simply partition the space and find a smaller and more manageable dataset that can represent the total dataset by effectively spanning the space of the full data.

Rate this post