Solved – the best tool for customer segmentation

I have a customer data set with the following data: The number of purchases that each customer made The Date that they made each purchase The Date that they signed up The amount they spent on each purchase I want to segment my users into three groups: Great Customers Ok Customers Bad Customers Is there … Read more

Solved – Is it OK to use correlated variables for cluster analysis

I know there is a series of regression diagnostics procedures (correlation, beta, residual, etc.) before, during, and after regression analysis. But, is there any common procedure to follow for cluster analysis (like, Ward)? What are the R commands? Thanks! Best Answer Correlation can cause problems with many clustering algorithms by giving extra weight on these … Read more

Solved – Best BIC value for K-means clusters

I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results: N_clusters BIC 1 -863.896405 2 -674.133038 3 -616.557809 4 -603.357368 5 -582.428798 6 -596.073710 7 -590.086212 8 … Read more

Solved – Best BIC value for K-means clusters

I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results: N_clusters BIC 1 -863.896405 2 -674.133038 3 -616.557809 4 -603.357368 5 -582.428798 6 -596.073710 7 -590.086212 8 … Read more

Solved – How to interpret these indices/metrics for comparing partitions intuitively out of these images

Two sets of comparisons were performed between original clustering and the new clustering using several indices and metrics of performance. Below are the two initial clusterings or partitions (these should be the truth or original partitions), they are two because samples were taken from two different locations and that's why we have always two sets … Read more

Solved – Multivariate time series clustering

I am collecting a group of multivariate time sequences. For example, there are 2000 time series. Each time series is of 12 dimensions. Are there any systematic models/algorithms that can cluster multivariate time series? For instance, I would like to identify some time series that are very different with others. Moreover, for the online monitoring, … Read more

Solved – what are some clustering algorithms that work with cosine similarity as distance measure

I am trying to find clustering algorithms which can work with cosine similarity for tweet classification. Best Answer Spherical k-means is the classical example. But really, any clustering algorithm that can take an arbitrary distance measure should be applicable, including DBSCAN; E.g., SciKit-Learn's implementation lets you choose the distance metric for DBSCAN, where you can … Read more

Solved – Correcting standard errors when the independent variables are autocorrelated

I have a question about how to correct standard errors when the independent variable has correlation. In a simple time series setting we can use Newey-West covariance matrix with a bunch of lags and that will take care of the problem of correlation in the residuals. What does one do in a panel data setting? … Read more

Solved – Inconsistency in calculating the Calinski-Harabasz index for a given clustering in R

I am interested in determining the optimal number of clusters calculated by the PAM clustering algorithm using the Calinski-Harabasz (CH) index. To that end, I found 2 different R functions calculating CH values for a given clustering, but which returned different results: ?cluster.stats (in the fpc package), and ?index.G1 (in the clusterSim package). First one … Read more

Solved – What are the “hot algorithms” for machine learning

This is a naive question from someone starting to learn machine learning. I'm reading these days the book "Machine Learning: An algorithmic perspective" from Marsland. I find it useful as an introductory book, but now I would like to go into advanced algorithms, those that are currently giving the best results. I'm mostly interested in … Read more