I recently received this email from a graduate student, and I get similar questions often enough, that I thought I'd post it here:
I'm using factor analysis, multiple regression, and SEM and currently
checking statistical assumptions. I have found numerous univariate
and multivariate outliers. If I deleted them all, it would mean a
large chunk out of my sample size ($N approx 350$). I also have
problems with non-normality, non-linearity, heteroscedasticity
(Multiple regression), and large standardised residual covariances
I have tried reducing the influence of the outliers (allocating them a
value one unit larger/smaller than the next most extreme non-outlier
value), and transformations (mostly the variables remained skewed and
some outliers remain). When I compare original results with altered
data, there is little effect. Given this, I am wondering whether it
would be acceptable to leave the data as it is? I'm inclined to,
particularly because this data is from a non-clinical population and I
have used clinical measures.
A lot depends on where exactly the outliers occur within the model — in the indicators? in the latent variables and their measurement errors? in the exogenous variables, at the top of the causal chain? In the former case, you cannot do much, as you really have a high leverage influential cases rather than outliers. To control for outliers in the indicators/response variables, you need to work at the equation level, like Moustaki and Victoria-Feser (2006) did. Shooting at it with the robust covariance matrices may or may not be the right thing to do. I am referring here to the recent work by Ke-Hai Yuan and Zhiyong Zhang of Notre Dame who tried to revive robust estimation methods as applied to structural equation modeling — see e.g. their R package
rsem (that seems to rely on having EQS as the estimation engine though, which is weird given the variety of choices within R). They've been publishing like crazy on this in the past five or so years; I've reviewed at least three papers for various journals, and frankly I am at a loss which one is to be recommended, as they all repeat each other. I have not seen this used much in applied work, although it probably should be; may be you'd be the trendsetter!
A great diagnostic tool is the forward search method developed by Atkinson and Riani of LSE (for regression and multivariate data). This has been adopted for SEM here and here. I personally think this is really neat, but whether it could catch up in the SEM community at large, I don't know.
Frontiers in Quant Psy published a review paper on this in early 2012. Even though I am the acknowledged reviewer of this work, I am extremely reluctant to really recommend it (it barely passed my threshold of publishable work, and I simply gave up explaining the theory of robust statistics in my referee letters), but I am just not aware of anything better.