I have dataset with clients orders. Example:
Customer_1 07.06.2017 Order_1 Product_1 Customer_1 15.06.2017 Order_2 Product_2 Customer_1 01.09.2017 Order_2 Product_1 Customer_2 07.05.2017 Order_3 Product_3 Customer_2 07.06.2017 Order_4 Product_2 Customer_2 25.09.2017 Order_5 Product_3 Customer_2 05.12.2017 Order_5 Product_1 .... Customer_N
How can I cluster these customers behavior? This dataset looks like time series. But It's difficult for me to find the right way for solving this problem. The history of each customer has different length. And I can't use simple clustering algorithms.
My major aim is to distinguish different customer behaviors, find persons who have started buy more frequently, who have changed their preferences in products (started buy other products), who have tried new for them products but back to previous products. How can I cluster patterns of behavior?
Best Answer
You data are timestamped event sequences. A solution to cluster your customers is to compute the pairwise dissimilarities between the sequences and then input the resulting matrix into any clustering procedure that works with such kind of input.
You can compute the pairwise dissimilarities with the optimal matching method for event sequences, OME, (see Ritschard et al., 2013) that is implemented in the TraMineRextras
R package, a companion of the TraMineR
package.
I illustrate below how you get the dissimilarity matrix for your two example sequences. We first need to create a TraMineR
event sequence object. We need for that numeric ids and dates as integers. So we first make these transformations. Also, I use Product
as the event and ignore Order
(which I do not understand what it is).
library(TraMineRextras) d <- c( "Customer_1", "07.06.2017", "Order_1", "Product_1", "Customer_1", "15.06.2017", "Order_2", "Product_2", "Customer_1", "01.09.2017", "Order_2", "Product_1", "Customer_2", "07.05.2017", "Order_3", "Product_3", "Customer_2", "07.06.2017", "Order_4", "Product_2", "Customer_2", "25.09.2017", "Order_5", "Product_3", "Customer_2", "05.12.2017", "Order_5", "Product_1" ) md <- matrix(d, nrow = 7, ncol=4, byrow=TRUE) md <- as.data.frame(md) md[,1] <- as.integer(gsub("Customer_", md[,1], replacement="")) md[,2] <- as.integer(as.Date(md[,2], format ="%d.%m.%Y")) names(md) <- c("Id","Timestamp","Order","Product") md ## Id Timestamp Order Product ## 1 1 17324 Order_1 Product_1 ## 2 1 17332 Order_2 Product_2 ## 3 1 17410 Order_2 Product_1 ## 4 2 17293 Order_3 Product_3 ## 5 2 17324 Order_4 Product_2 ## 6 2 17434 Order_5 Product_3 ## 7 2 17505 Order_5 Product_1 ## Creating the event sequence object eseq <- seqecreate(id=md$Id, timestamp=md$Timestamp, event=md$Product) ## event sequences with number indicating time intervals in days eseq ## [1] 17324-(Product_1)-8-(Product_2)-78-(Product_1) ## [2] 17293-(Product_3)-31-(Product_2)-110-(Product_3)-71-(Product_1)
Now computing the dissimilarities between sequences with OME
## you may have to play with the parameters idcost and vparam idcost <- rep(1,3) diss <- seqedist(eseq, idcost = idcost, vparam = .01) diss ## [,1] [,2] ## [1,] 0.0000000 0.7307344 ## [2,] 0.7307344 0.0000000
You can then cluster your sequences by inputting the diss
matrix to a hierarchical clustering method (e.g. the hclust
function) or to a partitioning around medoids method (see e.g. WeightedCluster
package that is specifically designed for sequences). Note that you may have to input diss
as distance matrix object as.dist(diss)
.
Similar Posts:
- Solved – Clustering time series when each object has multiple time series
- Solved – Develop a statistical test to distinguish two products
- Solved – Develop a statistical test to distinguish two products
- Solved – How to estimate most important dimensions of the clusters after performing k-means
- Solved – How to estimate most important dimensions of the clusters after performing k-means