Solved – How to find correlation between text data

I have data set similar to this:

data

I want to if the columns subtype and item are correlated. They have different text, hence the usual similarity methods cannot be used. I want to use some kind of hypothesis testing, but on text data. How can I find the relationship using statistical methods?

You should change the "item" and "subtype" into a number (frequency). The frequency represents the number of occurrences of each item in relation with the subtype. Please have a look at topics such as TF/IDF which will help find the frequencies then, correlation can be found statistically.

More importantly, finding the correlation is not an ML problem. Generally speaking, ML helps build predictive models for what will happen. But here in your question you are discovering correlations. Therefore, such data should be represented in number; count, distance, frequency of occurrences, and trying the find any correlation with each category.

Similar Posts:

Rate this post

Leave a Comment