How do I check for outliers in a simple regression with one predictor variable?
A simple way to check for outliers is to evaluate either standardized or studentized residuals and see if there are many with high values e.g. > +/- 2. The key reason for studentizing is that the variances of the residuals at different predictor values are different.
This can be done as follows:
- Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s).
- Evaluate a Pearson or Spearman correlation, R.
- Obtain the i-th raw residual as Y(si) - Rx(si)
- To obtain the standardized residual just divide by the standard deviation of the residuals. The mean raw residual should be zero.
- The studentized residual may also be used to identify potential outliers. This divides the raw residual by its standard error, SE_RES.
SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum over i (Y(si) - Rx(si))/(N-2) for N observations and h(ii) equals 1/N + x(si)2 /Sum over i x(si)2
Studentised residuals may be evaluated using this spreadsheet.
Outliers without adjusting for other variables
In this case where we are interested in outliers of a variable unadjusted for any others the studentized residual is approximately equal to the standardized residual (ie a z-score) for large N.
In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.
So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N].
The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] which approximately equals (Y - mean(Y))/SD when N is large.