Solved – Random Forests for predictor importance (Matlab)

I'm working with a dataset of approximately 150,000 observations and 50 features, using SVM for the final model. To trim down the feature count, I decided to look into using RF so SVM optimization doesn't take too long. I'm currently using the TreeBagger implementation in Matlab and had a few questions. When investigating feature importances, … Read more

Solved – How to handle high dimensional feature vector in probability graph model

I was doing some NLP related stuff which involves training a hidden Markov model, and use the model to segment sentences. For every sentence, I translate the tokens into feature vectors. The features are manually picked by me, and I can only think of 20 features temporarily. All of the features are binary. So an … Read more

Solved – ROC as feature selection

It is apparently a common practice to use ROC as a feature selection method in my new job. They test the variables one by one against the response and anything with ROC<=54 is tossed aside. Would you say that this is a good practice? I'm quit skeptical as I would have rather used ensemble learning … Read more

Solved – ROC as feature selection

It is apparently a common practice to use ROC as a feature selection method in my new job. They test the variables one by one against the response and anything with ROC<=54 is tossed aside. Would you say that this is a good practice? I'm quit skeptical as I would have rather used ensemble learning … Read more

Solved – Variable selection in time series data

I have an econometric dataset, 50 observations of 350 variables. They include things like GDP, unemployment, interest rates and their transformation such as YoY change, log transform, first differences etc. I need to build an arimax model, and first I need to select variables. 350 univariate regressions against the response were run, and the 20 … Read more

Solved – Variable selection in time series data

I have an econometric dataset, 50 observations of 350 variables. They include things like GDP, unemployment, interest rates and their transformation such as YoY change, log transform, first differences etc. I need to build an arimax model, and first I need to select variables. 350 univariate regressions against the response were run, and the 20 … Read more

Solved – What happens if I train a model on a data set that includes a duplicated feature

The Question Suppose I train a predictive model on a set of features $x_1, dots, x_n$, but for some $i neq j$ we have $x_i = x_j$ for every data point in the training set; i.e. one of these features is a totally redundant copy of the other. What are the consequences for learning? Does … Read more

Solved – What can be the reason to do feature selection based on variance before doing PCA

I have noticed that when applying PCA to large datasets, people often will first subset the data considerably. Sometimes people just randomly take a subset of the features/variables, but often they have a reason, largely related to removing variables they consider to be likely to be noise. A prototypical example is in the data analysis … Read more

Solved – Feature selection using cross validation

I am dealing with a typical $p > n$ problem in the medical field. (typically $p approx 3700$ and $n approx 100$ ). The dependent variable is binary (healthy/sick) and features are continuous variables representing intensities of a large set of bio-markers. The bio-markers (i.e. features) are extracted from samples using a feature selection algorithm … Read more

Solved – Effect of features that are highly correlated with each other on a decision tree

I have a dataset of roughly 500 features and am training a binary classifier using GBM – gradient boosted machines, an ensemble of decision trees. Of these 500 variables, I am sure some are highly correlated with each other, though probably not to the extent where they are linearly dependent. For example, one variable might … Read more