Solved – Passing Time Series Data to an Unsupervised model

I have a particular segment of temporal data for 3 days.
It looks like this:

Day_1_Hour_14 = [d1x1,d1x2,...,d1xn] and [d1y1,d1y2,...,d1yn] Day_2_Hour_14 = [d2x1,d2x2,...,d2xn] and [d2y1,d2y2,...,d2yn] Day_3_Hour_14 = [d3x1,d3x2,...,d3xn] and [d3y1,d3y2,...,d3yn] 

When I use these 3 days 1 hour each worth of data on an unsupervised algorithm like DBSCAN, is it better to give the x_axis as an increasing order of index with their respective y_value?


x_axis_total = [1,2,3,4.,...,n] from d1x1 to d3xn y_axis_total = [d1y1,d1y2,,.d1yn,d2y1,d2y2,...,d3y1,d3y2,..,d3yn] 

And then I pass x_axis_total and y_axis_total to DBSCAN

I am quite confused about how to pass the data to the algorithm.
Or is it better to include the actual timestamp of each observation as the x_value? When I use the timestamp the data points are aligned in a straight line from 1 - n in 4 rows in the graph and DBSCAN does not work good when applied like this.

Any suggestions will be appreciated

If I make the x_axis in an increasing order I get this result [Day 1 – 3 of Hour 14]:
enter image description here

If I use the timestamp as x_axis I get this result [Day 1 – 3 of Hour 14]:

enter image description here

My aim is to detect Outliers in my data and the data is vehicular traffic data – It is contextual. That is the reason I am selecting 4 similar days of the same hour for this example.

My main question is am I losing any information if I do it the first way I mentioned? That is, the x_axis in increasing order from 1-n because using timestamp gives me bizzart results.

Is there a better way to do it?

Your question would benefit from more explanation regarding the data as well as the target. I am not certain whether I fully understand your question, but i will summarize my thoughts to your problem.

It seems you wish to work on the task of time-series clustering.

1) Using DBSCAN:

  • You did not specify the dissimilarity measure that you are using. The quality of the clustering results strongly depends on the measure you choose to compare the time-series. A standard measure to use would be Euclidean Distance, yet the are quite a few reasons why not to use Euclidean Distance on time-series best explained here by Eamon Keogh (This is a link to one of his tutorials on time-series analysis).
  • Although DBSCAN is quiet a wonderful algorithm, it is highly sensitive to its parameters. Therefore I would suggest to firstly use simpler algorithms. You can find alternatives here.

2) Time-series clustering: You can cluster time-series either directly on the time-series data using dissimilarity measures such as Dynamic Time Warping (DTW) or you can transform your time-series into a feature space (such as mean, max, min, kurtosis, skewness per dimension) and use Euclidean Distance in the feature space. The choice of relevant/appropriate features depends on the nature of the data – which you did not specify.

If you wish to get more specific answers, please plot some time-series, explain what sort of clusters you expect and describe the origin of your data.

Similar Posts:

Rate this post

Leave a Comment