Solved – Increasing the sample size does not help the classification performance

I am training a SVM classifier based on a given document collections. I started from using 500 documents for training, then I add another 500 for training, and so on. In other words, I have three training sets, 500, 1000, 1500. And the smaller training set is a subset of the sequential larger set. I … Read more

Solved – Classifer for unbalanced dataset

Is there any classifer that can natively support unbalanced datasets? Or what best practices you can suggest to handle such datasets? For example I want to solve task called "pedestrian detection" classical approach use linear SVM, but it can't handle unbalanced dataset (lots of background examples, small number of positive examples – people).Maybe there is … Read more

Solved – Classifer for unbalanced dataset

Is there any classifer that can natively support unbalanced datasets? Or what best practices you can suggest to handle such datasets? For example I want to solve task called "pedestrian detection" classical approach use linear SVM, but it can't handle unbalanced dataset (lots of background examples, small number of positive examples – people).Maybe there is … Read more

Solved – KNN classifier + cross validation

how can I find the mean and standard deviation of error rate or accuracy of a k- fold cross validation performing K-nearest-neighbour classification model for each fold? Best Answer The mean and standard deviation of you metrics are calculated across results of all cross validation (CV) partitions. So, if you have 10 CV partitions with … Read more

Solved – Equal Covariance in Linear Discriminant Analysis

In an online course, we are working through some linear discriminant analysis and I've been given an example. I am having trouble with the language used by the professor as it seems I am misunderstanding the assumptions regarding covariance. The exact language for the example is as follows: "Your data is from two classes and … Read more

Solved – Mahalanobis distance on non-normal data

Mahalanobis distance, when used for classification purposes, typically assumes a multivariate normal distribution, and the distances from the centroid should then follow a $chi^2$ distribution (with $d$ degrees of freedom equal to the number of dimensions/features). We can calculate the probability that a new data point belongs to the set using its Mahalanobis distance. I … Read more

Solved – Logistic regression on categorical data

I have large dataset (around 2 million records and 300 features) with a lot of missing data. Most of the independent variables are categorical (some of these variables have more than 40 valid values). The outcome is either Y or N. The Y outcome is a rare event: around 98% of outcomes are N. I'm … Read more

Solved – Classification with partially “unknown” data

Suppose I want to learn a classifier that takes a vector of numbers as input, and gives a class label as output. My training data consists of a large number of input-output pairs. However, when I come to testing on some new data, this data is typically only partially complete. For example if the input … Read more

Solved – Classification with partially “unknown” data

Suppose I want to learn a classifier that takes a vector of numbers as input, and gives a class label as output. My training data consists of a large number of input-output pairs. However, when I come to testing on some new data, this data is typically only partially complete. For example if the input … Read more

Solved – How to explain poor classification performance of recall when using SVM

I applied SVM to perform the classification against several data sets. It turns out that the performance metric of recall is pretty bad for one data set. It has recall around 50% while other data sets have recall around 80%. For this kind of scenario, what are the possible approaches that are available to improve … Read more