Which of the following machine learning algorithms will be affected if we apply feature scaling?

- Naïve-Bayes
- k-Nearest Neighbor (KNN)
- Support Vector Machine (SVM)
- Decision Trees
- Neural Network (NN)

**Contents**hide

#### Best Answer

**KNN** algorithm is seriously affected because you choose the $K$ closest samples for your predictions. If one of the features has large values (e.g. $approx$ 1000), and the other has small values (e.g. $approx 1$), your predictions will favor the feature with large values because the distance calculated will be dominated with it.

**SVM** is affected because in the end you're trying to find a max-margin hyperplane separating the classes (or for making regressions). For example, if $mathbf{x_1}$ and $mathbf{x_2}$ are support vectors, we are interested in maximizing the distance between them, i.e. $||mathbf{x_1-x_2}||$. Elements of these vectors are features. And, if we don't want some large features dominating the distance formulation, scaling is necessary.

**Decision Trees** doesn't need it actually. Because, it just tries to find a threshold value for a given feature that best splits the samples. And, whether you scale it or not, a similar threshold will be chosen, since the *ordinality* of the variables doesn't change.

**Neural Networks** are surely affected. One very obvious reason is their activation functions, e.g. sigmoid, tanh have very small derivatives when large values are involved, that can cause numerical difficulties also. The simplest of NN is logistic regression, where you again deal with linear boundaries as in SVM, by the way.

In **Naive Bayes**, the critical formula affected by features is the (naive) likelihood $P(x|C_i)=prod p(x_j|C_i)$. The probability distribution of features is not affected by the scaling, since it is one-to-one, we'll have $p(X_j=x_j|C_i)=p(X_j'=x_j'|C_i)$ where apostrophe indicates scaled version of the variable. For example, in typical Bag of Words representation, you don't scale the features; on the contrary they've special meanings as the counts of each word.

### Similar Posts:

- Solved – Why is SVM sensitive to scaling of features?
- Solved – Decision trees variable (feature) scaling and variable (feature) normalization (tuning) required in which implementations
- Solved – What are the benefits of feature scaling
- Solved – Do neural networks capture relationships between features
- Solved – Should I select features before using decision tree