I have a breast cancer data with 13 features. I instantiate PCA and put the training data through and transform down into 2 dimensions.
I get a data that is linearly separable which is interesting to me since I'm doing a binary logistic regression.
I'm writing an article and showing the data with a decision boundary is a good image to show that the model worked.
I train the logistic regression model on the 2-d data from the PCA. I plot the decision boundary using the intercept and coefficient and it does linearly separate the data.
My question is, What does plot of the 2-d data from the PCA and decision boundary mean? What can deduce from that?
Jupyter notebook to my work (It's the very last plot)
Edit: Plot with the target class
Plot with predict
on x_pca
Best Answer
What does the 2-d PCA data/plot mean?
The 2-d PCA data/plot represent two "compound features" which PCA created to capture as much of the variance in your original 13 features as possible.
Assuming your 13 features are linearly independent (e.g. one feature is not just another feature times 2, for every row in your data), it would take 13 dimensions to capture 100% of the variation in your 13 raw features. However, often PCA can capture e.g. 98% of the variation in your data in just a few PCA dimensions. To see how much of the variance is explained by each PCA dimension for your problem, print x_pca.explained_variance_ratio_
after you fit()
your x_pca
object.
When PCA can capture a large amount of the variance of your features in just 2 dimensions, that's especially convenient because then you can plot those 2 PCA dimensions as you have, and know that any groupings which show up on the 2-d plot correspond to natural groupings in your 13-dimensional data.
What does the decision boundary mean?
The decision boundary in your code is a prediction of your target
variable, using as features (independent variables) the first two PCA dimensions of your 13-dimension original feature set.
Why is your decision boundary not in the obvious gap?
Remember the PCA dimensions were formed just based on your 13 independent variables, without looking at your target
. The decision boundary is not a decision boundary between PCA clusters, it's a decision boundary using the PCA dimensions to predict target
.
So the fact that the decision boundary is not totally between the clusters means PCA's first two dimensions of your 13 features do a good job of separating your target
classes, but not a perfect job.
How to improve your plot
What you really care about is target
, right? So in your plot, don't plot all the points as red. Color them by target
class. Then you will have a plot that shows you how well PCA and the information in your original features (represented by distance/space between clusters on your plot) distinguishes between target
classes (which would be colors on the new plot).
Similar Posts:
- Solved – What methods can be used to transform data
- Solved – Why decision boundary is of (D-1) dimensions
- Solved – Will non-linear data always become linear in high dimension
- Solved – Binary classification using radial basis kernel SVM with a single feature
- Solved – How does the shape of a decision boundary in relate between the original and kernel feature space