I have a particular segment of temporal data for 3 days.
It looks like this:
Day_1_Hour_14 = [d1x1,d1x2,...,d1xn] and [d1y1,d1y2,...,d1yn] Day_2_Hour_14 = [d2x1,d2x2,...,d2xn] and [d2y1,d2y2,...,d2yn] Day_3_Hour_14 = [d3x1,d3x2,...,d3xn] and [d3y1,d3y2,...,d3yn]
When I use these 3 days 1 hour each worth of data on an unsupervised algorithm like DBSCAN, is it better to give the x_axis as an increasing order of index with their respective y_value?
Meaning,
x_axis_total = [1,2,3,4.,...,n] from d1x1 to d3xn y_axis_total = [d1y1,d1y2,,.d1yn,d2y1,d2y2,...,d3y1,d3y2,..,d3yn]
And then I pass x_axis_total
and y_axis_total
to DBSCAN
I am quite confused about how to pass the data to the algorithm.
Or is it better to include the actual timestamp
of each observation as the x_value? When I use the timestamp
the data points are aligned in a straight line from 1 - n in 4 rows
in the graph and DBSCAN does not work good when applied like this.
Any suggestions will be appreciated
If I make the x_axis in an increasing order I get this result [Day 1 – 3 of Hour 14]:
If I use the timestamp as x_axis I get this result [Day 1 – 3 of Hour 14]:
My aim is to detect Outliers in my data and the data is vehicular traffic data
– It is contextual. That is the reason I am selecting 4 similar days of the same hour for this example.
My main question is am I losing any information if I do it the first way I mentioned? That is, the x_axis in increasing order from 1-n
because using timestamp
gives me bizzart results.
Is there a better way to do it?
Best Answer
Your question would benefit from more explanation regarding the data as well as the target. I am not certain whether I fully understand your question, but i will summarize my thoughts to your problem.
It seems you wish to work on the task of time-series clustering.
1) Using DBSCAN:
- You did not specify the dissimilarity measure that you are using. The quality of the clustering results strongly depends on the measure you choose to compare the time-series. A standard measure to use would be Euclidean Distance, yet the are quite a few reasons why not to use Euclidean Distance on time-series best explained here by Eamon Keogh (This is a link to one of his tutorials on time-series analysis).
- Although DBSCAN is quiet a wonderful algorithm, it is highly sensitive to its parameters. Therefore I would suggest to firstly use simpler algorithms. You can find alternatives here.
2) Time-series clustering: You can cluster time-series either directly on the time-series data using dissimilarity measures such as Dynamic Time Warping (DTW) or you can transform your time-series into a feature space (such as mean, max, min, kurtosis, skewness per dimension) and use Euclidean Distance in the feature space. The choice of relevant/appropriate features depends on the nature of the data – which you did not specify.
If you wish to get more specific answers, please plot some time-series, explain what sort of clusters you expect and describe the origin of your data.