This is a question about classification. I am a neuroscience student with little experience of classification methods and I'd be grateful for any advice about the best way to implement a linear classifier (LDA) on this data.

I have a magnetoencephalography dataset, recorded from people as they perform a cognitive task. This has the following properties:

306 channels of data, but a preprocessing step has reduced the dimensionality to 64, and then reprojecting the data onto the sensors.

The data is fine-grained in the time domain (1000Hz)

The data is chopped up into short segments termed 'trials'. These correspond to the cognitive task the subjects were performing. The trials can differ, for example, in trial type A (of which I will have ~50) the subject may have been asked to pay attention to stimuli on the left hand side of space, and in trial type B (of which I have also 50) to pay attention to stimuli on the right hand side of space).

I want to classify my data at particular timepoints within trials (e.g. 0.5 seconds after subjects are told to attend left/right), training the classifier to discriminate between trials of type A and type B. The feature vector for each observation (i.e. trial) is the vector of instantaneous activity at each sensor at that timepoint. I have more features than trials (306 sensors) so I either need to do feature selection or use regularized LDA (or use something like Hastie's sparse discriminant analysis which seems to me to do both).

This I could do, BUT it seems like it throws away the information about the statistical structure of the data contained in the datapoints NOT from the exact time I am trying to classify. Also, I know the data has a lower dimensionality than 306 – less than 50 components will likely capture the vast majority of the variance in the data.

I was thinking therefore about using a dimensionality reduction step, probably PCA, before passing the reduced-dimensionality data to an unregularized LDA or naive bayes classifier. **The idea is that the dimensionality reduction step exploits the fact that I have a large amount of data sampled over time.**

And that's where I got confused. PCA projects the data onto orthogonal dimensions, so does it makes sense to then do LDA (which estimates a covariance matrix in order to use information about the correlation between the features)? Or should I do naive bayes following the dimensionalty reduction? In which case, I could probably have just done naive bayes to start with, given that it's not sensitive to the number of features in the way that LDA is.

If anyone can advise on a good approach here, given the particular structure of this data, I'd be very grateful. The key question is: does PCA followed by LDA make sense?

**Contents**hide

#### Best Answer

PCA calculates the eigenvalues that explain most of the variation across the data, in this case it would operate per feature vector and does not take account of class labels. LDA maximizes Fishers discriminant ratio (or Mahalaobis distance), i.e. it maximizes the distance between classes.

If you define the feature vector for each observation (case) as the data at an instantaneous time point, then the temporal components of the data are not relevant. In this case you can apply PCA as pre-processing stage to each feature vector to reduce dimensionality prior to classification.

If however, you define each trial as a 10s epoch or segment around the point of interest, you could then calculate a summary statistic for each sensor across all time samples in the epoch. Each feature in your feature vector would then be a summary of the behaviour of each sensor over the 10s (e.g. mean amplitude across each 10s epoch). You could then apply PCA as pre-processing step to reduce the dimensionality of the feature vector from 306 to a more manageable number.

This second approach assumes that summary statistics calculated over each 10s epoch contains more information relevant to your problem than the instantaneous feature detailed above.

### Similar Posts:

- Solved – Does PCA followed by LDA make sense, when there is more data available for PCA than for LDA
- Solved – Does PCA followed by LDA make sense, when there is more data available for PCA than for LDA
- Solved – Correlated binomial distribution
- Solved – Best Validation accuracy occurs early on in the training process
- Solved – Determining the statistical significance of differences in a small dataset