Solved – Can a Decision Tree handle a column which is an array or strings

I have this dataset where one of the columns (features) is an array of delays codes. Sometimes the array has got 1 single code and sometimes up to 5 codes. The codes can appear just once in the array or multiple times.

Is there any way to solve this problem?

I want to add that I am actually planing to use xgboost, but I ask about decision tree since xgboost is based on decisions trees and I believe the answer can be extrapolated

The larger part of machine learning in an applied setting is , which describes the task of transforming something that exists in the real world (airline passengers, books, images of objects) into a "format" that a machine learning algorithm can understand. I use "format" in an extremely broad sense, not merely the idea of a "file format" like .png or .tsv.

Feature engineering rarely has an unambiguously "correct" answer. Usually, there are several alternatives which could be successful or better under particular conditions which are peculiar to whatever phenomenon you're studying. Stated another way, the person best suited to answer questions about how to represent your problem to a machine learning algorithm is the person studying the problem, i.e. you.

It sounds like your data is tabular (because you talk about "columns"), and that one "column" can contain one or more categorical variables.

This isn't a problem on its own; it's only a problem when you seek to present this data to machine learning algorithm which anticipates that each column will contain a float.

The standard way to treat categorical data is to encode each category as a binary feature, taking 1 when the category is present and 0 otherwise.

You say that some codes can appear more than once. This is where the ambiguity enters the picture. Is it sufficient to define your binary feature to indicate that the category was present 1 or more times (encoded as 1) or should you count the number of occurrences of a code (encoded as 0 or 1 or 2, etc.)? I don't know. It depends on your problem and the choice of algorithm.

You've written that you are specifically interested in using a tree-based model. Happily, if there is no benefit to encoding the data as counts, then it won't hurt the model to encode the data as counts. To understand what I mean, consider how a binary decision tree works: constructing splits based on some threshold. If the binary encoding works for you data, it will split at some number between 0 and 1, and never any other split. But if the count encoding is important, then the tree might also choose to split at another point that is not between 0 and 1 because that split improves the model. (All of the foregoing applies to the training data; out-of-sample performance could be improved or harmed by either strategy.)

Similar Posts:

Rate this post

Leave a Comment