How do I detect multivariate outliers?

The Mahalanobis distance (MD) on p variables is used as a multivariate outlier detection measure and compared with a chi-square distribution on p degrees of freedom (Rousseeuw and Van Zomeren (1990) and here).

For a p dimensional vector, x(i), on observation i with corresponding mean vector, mean, and a sample covariance matrix, C, we have

MD(i) = Square Root of (x(i) - mean)T C-1 (x(i) - mean)

If MD(i) is greater than $$\chi2(p,0.999) then this suggests observation i is an outlier. This high level of significance corresponding to a p<0.001 is recommended by most authors including the oft quoted Tabachnich and Fidell (1996) and Hair et al (1998). Hair et al., instead of using the chi-square distribution, instead divide the Mahalanobis distance by the number of variables and compare to a t distribution with df equal to the number of variables. MD can also be displayed for each observation using the \SAVE subcommand in the linear regression procedure in SPSS (see for example below form MD computed using two variables, x1 and x2 and also the worked example here.) Mahalanobis distances may also be computed using DeCarlo's(1997) SPSS macro from here.

COMPUTE Y=$CASENUM.
EXE.
REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN
  /DEPENDENT y
  /METHOD=ENTER x1 x2
  /SAVE MAHAL .

There is also a function, Mahalanobis, in R which will work out Mahalanobis distances. Its use in multivariate outlier detection in R is illustrated here and reproduced below for a data matrix called df:

# Calculate Mahalanobis Distance with height and weight distributions
m_dist <- mahalanobis(df[, 1:2], colMeans(df[, 1:2]), cov(df[, 1:2]))
df$m_dist <- round(m_dist, 2)

# Mahalanobis Outliers - Threshold set to 12
df$outlier_maha <- "No"
df$outlier_maha[df$m_dist > 12] <- "Yes"

The cut-off used in the above R example (12) is approximately $$\chi2(2,0.999)=13.82.

Tabachnich and Fidell also suggest converting the distances into leverage distances and using cut-offs.

Filzmoser et al (2003), however, propose using an adjusted form of p which takes into account total sample size, n. Then the critical value becomes $$\chi2$$(padj,0.975) where

padj = [0.24 - 0.003 p]/Sqrt[n] p <= 10

or

padj = [0.25 - 0.0018 p]/Sqrt[n] p > 10

This method tends to increase the number of outliers as sample size goes up and is less conservative than the Rousseeuw and Van Zomeren approach.

Weir and Murray (2011) also mention a simpler version of the Mahalanobis distance that can be used to detect aberrant cases when looking for fraudulent data from clinical trials. This approach computes the sum of p squared standardised values (squared z-scores) on a person or case (z-score = [individual score minus the variable mean] / variable standard deviation) and comparing to a chi-square distribution on p degrees of freedom. They then further use this measure to detect 'inliers' which are data points of cases which are very close to the average values by plotting the distribution of the (natural) logs of these squared standardized case sums and looking graphically for any pattern of suspiciously too 'typical' cases.

References

Barnett V & Lewis T (1978) Outliers in statistical data. Wiley:New York. This book features a table of critical values for Mahalanobis distances to suggest thresholds for outiers. The table is indexed by number of predictors and sample size. e.g. for N=100 and fewer than 3 predictors values of Mahalanobis distances greater than 15 are outlying and for N=30 and 2 predictors, values > 11 are outlying.

Filzmoser P, Reimann C and Garrett RG (2003) Multivariate outlier detection in exploration geochemistry. Technical report TS 03-5, Department of Statistics, Vienna University of Technology, Austria.

Penny KI (1996) Appropriate critical values when testing for a single multivariate outlier by using the Mahalanobis distance. Applied Statistics 45(1) 73-81.

Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85(411), 633-651.

Weir C and Murray G (2011) Fraud in clinical trials. Detecting and preventing it. Significance 8(4) 164-168.