I have data set similar to this:

I want to if the columns `subtype`

and `item`

are correlated. They have different text, hence the usual similarity methods cannot be used. I want to use some kind of hypothesis testing, but on text data. How can I find the relationship using statistical methods?

**Contents**hide

#### Best Answer

You should change the "item" and "subtype" into a number (frequency). The frequency represents the number of occurrences of each **item** in relation with the **subtype**. Please have a look at topics such as TF/IDF which will help find the frequencies then, correlation can be found statistically.

More importantly, finding the correlation is not an ML problem. Generally speaking, ML helps build predictive models for what will happen. But here in your question you are discovering correlations. Therefore, such data should be represented in number; count, distance, frequency of occurrences, and trying the find any correlation with each category.

### Similar Posts:

- Solved – How to report the performance of clustering algorithm on labeled data
- Solved – How to calculate mutual information from frequencies
- Solved – Overall rank from multiple columns
- Solved – bi-factor cfa, multiple method factors, DWLS vs MLS in lavaan
- Solved – Cronbach’s alpha for a questionnaire consisting of several scales