Diff for "FAQ/RegressionOutliers" - CBU statistics Wiki
location: Diff for "FAQ/RegressionOutliers"
Differences between revisions 12 and 22 (spanning 10 versions)
Revision 12 as of 2008-01-29 10:20:08
Size: 1345
Editor: PeterWatson
Comment:
Revision 22 as of 2015-05-06 16:05:24
Size: 2154
Editor: PeterWatson
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
Leverage is also related to the i-th observation's [:FAQ/mahal:Mahalanobis distance], $$\mbox{MD}_text{i}$$, such that for sample size, N Leverage is also related to the i-th observation's [[FAQ/mahal|Mahalanobis distance]], MD(i), such that for sample size, N
Line 6: Line 6:
Leverage for observation i = \frac{$$\mbox{MD}_text{i}}{\mbox{N-1}} + (1/N)}$$ Leverage for observation i = MD(i)/(N-1) + 1/N
Line 10: Line 10:
Critical $$\mbox{MD}_text{i} = (\frac{\mbox{2(p+1)}}{\mbox{N}} - \frac{1}{\mbox{N}})(\mbox{N-1}) $$ Critical MD(i) = (2(p+1)/N - 1/N)(N-1)
Line 14: Line 14:
Hair, Anderson, Tatham and Black (1998) suggest Cook's distances greater than 1 are influential. Other outlier detection methods using boxplots are in the Exploratory Data Analysis Graduate talk located [[StatsCourse2009|here]] or by using z-scores using tests such as Grubb's test - further details and an on-line calculator are located [[http://www.graphpad.com/quickcalcs/Grubbs1.cfm|here.]]

Hair, Anderson, Tatham and Black (1998) suggest Cook's distances greater than 1 are influential. Hair et al mention that some people also use 4/(N-k-1) for k predictors and N points as a threshold for Cook’s distance which gives a lower threshold than 1 (e.g. with 1 predictor and 27 observations this gives 4/(27-1-1) = 0.16). A third threshold of 4/N is also mentioned (Bollen and Jackman (1990)) which would give a threshold of 4/27 = 0.14 in the above example.
Line 17: Line 19:

'''Bollen, K. A. and Jackman, R. W. (1990)''' Regression diagnostics: An expository treatment of outliers and influential cases, in Fox, John; and Long, J. Scott (eds.); Modern Methods of Data Analysis (pp. 257-91). Newbury Park, CA: Sage
Line 22: Line 27:
[wiki:FAQ Return to Statistics FAQ page] [[FAQ|Return to Statistics FAQ page]]
Line 24: Line 29:
[wiki:CbuStatistics Return to Statistics main page] [[CbuStatistics|Return to Statistics main page]]
Line 26: Line 31:
[http://www.mrc-cbu.cam.ac.uk/ Return to CBU main page] [[http://www.mrc-cbu.cam.ac.uk/|Return to CBU main page]]
Line 28: Line 33:
These pages are maintained by [mailto:ian.nimmo-smith@mrc-cbu.cam.ac.uk Ian Nimmo-Smith] and [mailto:peter.watson@mrc-cbu.cam.ac.uk Peter Watson] These pages are maintained by [[mailto:ian.nimmo-smith@mrc-cbu.cam.ac.uk|Ian Nimmo-Smith]] and [[mailto:peter.watson@mrc-cbu.cam.ac.uk|Peter Watson]]

Checking for outliers in regression

According to Hoaglin and Welsch (1978) leverage values above 2(p+1)/n where p predictors are in the regression on n observations (items) are influential values. If the sample size is < 30 a stiffer criterion such as 3(p+1)/n is suggested.

Leverage is also related to the i-th observation's Mahalanobis distance, MD(i), such that for sample size, N

Leverage for observation i = MD(i)/(N-1) + 1/N

so

Critical MD(i) = (2(p+1)/N - 1/N)(N-1)

(See Tabachnick and Fidell)

Other outlier detection methods using boxplots are in the Exploratory Data Analysis Graduate talk located here or by using z-scores using tests such as Grubb's test - further details and an on-line calculator are located here.

Hair, Anderson, Tatham and Black (1998) suggest Cook's distances greater than 1 are influential. Hair et al mention that some people also use 4/(N-k-1) for k predictors and N points as a threshold for Cook’s distance which gives a lower threshold than 1 (e.g. with 1 predictor and 27 observations this gives 4/(27-1-1) = 0.16). A third threshold of 4/N is also mentioned (Bollen and Jackman (1990)) which would give a threshold of 4/27 = 0.14 in the above example.

References

Bollen, K. A. and Jackman, R. W. (1990) Regression diagnostics: An expository treatment of outliers and influential cases, in Fox, John; and Long, J. Scott (eds.); Modern Methods of Data Analysis (pp. 257-91). Newbury Park, CA: Sage

Hair, J., Anderson, R., Tatham, R. and Black W. (1998). Multivariate Data Analysis (fifth edition). Englewood Cliffs, NJ: Prentice-Hall.

Hoaglin, D. C. and Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician 32, 17-22.

Return to Statistics FAQ page

Return to Statistics main page

Return to CBU main page

These pages are maintained by Ian Nimmo-Smith and Peter Watson

None: FAQ/RegressionOutliers (last edited 2015-05-06 16:06:06 by PeterWatson)