FI=LAB9907

 

Lab (and homework):

Examining Residuals, Plotting Residuals, Getting Rid of Outliers

OBJECTIVES. In this lab you will
1) save residuals and various residual diagnostics from a simple regression run
2) look at the residuals
3) decide if they meet or violate the assumptions about residuals
4) consider steps to take when you have a troubling case, i.e., a case that is markedly influencing the results of a regression.
5) Repeat steps 1-5 again with a different problem
6) In this lab we are going to try and wean ourselves from MOUSE CLICKS. What follows at the end of this lab is the COMMANDS you want the program to generate. I clearly indicate what each section of the different commands accomplish. What you want to do is to see if you can figure out how to tell it to do what you are asking.
You can CHECK and see what commands you are generating by looking at the OUTPUT file - your commands appear there.
If you are running a FULL version of SPSS you could just submit these commands in a syntax box, and it would run

NOTE ABOUT USING PORTIONS OF THE COMMANDS AS A SYNTAX FILE: I think I have figured out why the command files would not run. The browser does not save spaces like tabs unless they are hard spaces. I have tried this time around to put in hard spaces, so the program can distinguish between commands and subcommands.


EXAMPLE 1:

Predicting 1990 State Reported Violent Crime Rate from the 1985 Rate

You want to see whether, in terms of reported violent crime rates, there are some particular states that increased a lot, compared to others. Are there some that decreased a little, compared to others? In o ther words, what states increased MORE than predicted? What states increased LESS than predicted?

We have established earlier, by means of a dependent t-test, that the 1990 rate (average = 534) was significantly higher than the 1985 rate (average=406). So we know that on average the reported violent crime rate went up for the states.
Since skewness of these two variables is within acceptable limits (around .5) we do not need to do any transformations.
(Of course, you would always look first at your univariate statistics to see if there are any problems.)
You examine your bivariate plot and the data suggest there are no outliers.
So now you want to run the simple regression.
Before we start we are going to sort the file by the independent variable so that particular cases will be easier to find.
Now we are going to run the regression, and SAVE several features of the residuals. We are particularly interested in:


VARIABLE NAME

WHAT SPSS WILL CALL IT

Unstandardized predicted values

PRE_1 Predicted Value

Unstandardized residual values

RES_1 Residual

Cook's D (distance)

SRE_1 Studentized Residual

Leverage

COO_1 Cook's Distance

Studentized Residuals

LEV_1 Leverage

NOTE. The numeric extension SPSS adds to the end of the names of these variables WILL VARY. If these are the first ones being put into the data file, their extension will be _1. But if they are the second their extension will be _2. And so on. The only way to recall what these residuals and residual diagnostics correspond to IS TO ADD VARIABLE LABELS TO THESE VARIABLES AS SOON AS THEY ARE CREATED. See your SPSS manual on how to do this. It is not hard. Ask me in lab if you are still having trouble figuring it out. But when you start running lots of regressions and saving lots of residuals and residual diagnostics, it can get ugly fast.


We are not going to get any plots at this moment, but will get some plots with the residuals

AFTER RUNNING THE REGRESSION:
* At the end of the output it tells you the names of the new variables it is saving
* Save the file again with its new name

AN EXPLANATION OF THE NEW VARIABLES YOU HAVE ADDED TO YOUR FILE

 

Here is what they mean. (See NOT9907 for more details.)

UNSTANDARDIZED PREDICTED VALUE [PRE_1]
The Ypredicted score as defined by the equation that generated the estimate. In this regression
VIOLRA90predicted = 14.319+(1.278*VIOLRA85)
The set of points (X,Ypredicted) are the points on the regression
line Ypredicted = A + B*Xc

 
UNSTANDARDIZED RESIDUAL [RES_1]
These are the raw residual scores. Remember:
Yresidual = Yactual - Ypredicted
On a regression plot, for each case, it is the vertical distance between an (Xactual, Yactual) data point and the regression line (Xactual, Ypredicted). Since it is a vertical distance, X stays the same, and we can drop it out. So it is the distance between Yactual and Ypredicted.

LEVERAGE [LEV_1]
"Cases with high leverage pull the regression surface [line] toward them more than other cases do" (Darlington p. 356; see also Hamilton pp. 130-131) It is most relevant when you have two or more predictors. (We are not there yet. When you have more than one predictor it tells you about the extent to which a case has an atypical pattern of scores on the predictor variables. ) When youhave one predictor leverage can be high if you have a case scoring much higher or much lower on X than the other cases.
Leverage can range from 1/n to 1. As the number gets higher, the case is pulling the regression line or surface toward it more than other cases are.
People disagree on what cutoff should be used for leverage being too high. Hamilton suggests that if the highest leverage in the sample is between .2 and .5, the sample is "risky" in that one case is heavily pulling the regression line [surface] toward it.
In this example, the case flagged in your output as having large leverage (.14) was New York, which among all the states had the highest 1985 reported violent crime rate. Note that the regression line goes exactly through New York. So although NY has a small residual it has large leverage.

COOK'S D[COO_1]
COOKS D provides a measure of influence - how much that data point actually effects the position of the regression line (or plane). It tells you how much the regression line (plane) moves if that case is deleted. (See Hamilton, p. 132; Darlington p. 345 on)
It measures influence on the model as a whole; so when we get to multiple predictors if you eliminate a case with a high D it can change the coefficients of not just one predictor, but the coefficients of several.
The higher the D, the more influential the case.
See Hamilton's figure 4.14, p. 131.
Unusually influential cases will have D > 4/n. (Hamilton p. 132) So, with our 50 cases here D > 4/50 = D > .125.
In other words you might want to be suspicious if you have a case with Cook's D larger than this. You should at least look at the case if it does have D greater than the cutoff value, and see where it is in the plot.

STUDENT
These are studentized residuals. (See Hamilton p. 132.) For each case the studentized residual provides a scale-free measure of distance from the regression line (or plane when we get to multiple regression). To see if significant you do a single sample t-test with N-p-1 degrees of freedom. (See Darlington p. 357; for more on the t-test see Hale.) BUT YOU NEED TO MAKE THE BONFERRONI CORRECTION - we will talk about it; see below.
The studentized t takes into account not only the size of the residual, but also its leverage. STUDENT increases as:
-- the residual increases
-- the leverage of that case increases
With STUDENT you can test the hypothesis: (see Hamilton p. 132):
if case i were eliminated, would there be a significant shift in the intercept of the regression line?
BUT if you want to do a significance test, i.e., really test this hypothesis, you must adjust your alpha level to control for the number of t-tests you are doing. This is the BONFERRONI CORRECTION. See Darlington p. 358.
You do this by dividing your alpha level by the number of tests you are making. After all, if you had 50 t-tests, simply by chance, what would be the number of significant t-values you would get at p < .05 (two tailed)?
If you run a frequency distribution on STUDENT you will see that you have 3 t-values significant at p < .05.
Your adjusted alpha level should be (alpha/n of t-tests), so if you are working with 50 cases:

adjusted alpha = .05/50 = .001.

At this alpha level how many significant t-values do we have?
WHICH OF THESE DO I NEED TO KNOW.
For now, you need to understand PREDICTED, RESIDUAL, STUDENT, AND LEVERAGE. I will not hold you responsible for COOK.

LET'S FIGURE OUT WHICH CASE IS WHICH.

We are going to list some scores for some variables using the LIST command Let's do:
An Identifier
Our Predictor
Our Outcome
Our Predicted Y
Our Residual
Studentized Residuals
i.e., STATNAM VIOLRA85 VIOLRA90 PRE_1 RES_1 SRE_1

HOW TO EXPLORE RESIDUALS
The first thing to do is to plot them. Do a histogram, getting the descriptive statistics as well. You also could do stem and leaf plots, if you wished, or box and whisker plots.
But even more important is to look at is the relationship between the predictor and the residuals. Run this scattergram. When you look at this plot you are directly examining two assumptions in regression. These are important assumptions that you want to verify
- equal variance of residuals at different levels of X values
- X values independent of residual values, across the cases
We can talk later in class about what plots of violations of this assumption would look like. The important point is that if these assumptions are violated, statistical tests are inappropriate.


 



EXAMPLE 2

DATA FILE: Individual level file
PURPOSE:
* To create principal components scores. These are index scores. Look in the main menu under data reduction. DEPENDING ON THE VERSION OF SOFTWARE YOU HAVE, YOU MIGHT NOT HAVE THIS OPTION, SO YOU WANT TO BE SURE TO RUN THIS IN THE LAB. We are telling it to focus on a subset of confidence variables (Q20 through Q30), and then to come up with 2 common factors, and save the principal component scores. Don't worry about not having the background to do this now. We will get it later. But just be sure you run it, get two factors, and save the scores. It gives us a nice outcome with some interesting predictors. If you have problems getting these created there is an alternate procedure noted.
* To predict CONFI1 - confidence in public institutions - using age.
* Look at the results and at the residuals.
* Use the residuals plotted, and residual diagnostics, to decide if we are violating some important assumptions behind regression.


HOMEWORK FOR NEXT TIME


Focusing on the descriptive statistics for the residuals, the histogram of the residuals, and the plot of X * residuals, describe whether or not the assumptions about error terms that are part of regression appear to be met. When you are describing the scatteplot, be specific, naming particular data points that seem particularly important to you. When describing the histogram of residuals, likewise, be specific.
Also, be specific about which assumption you are referring to.
Attach all charts.
No more than 4 pages typed, double spaced
Do this for BOTH regressions.

In addition you should run the following multiple regressions, and bring these printouts to class with you. You do not need to write anything about them. Just run them, and, after having read NOT9908, have a shot at interpreting them.


COMMANDS FOR ECOLOGICAL DATA FILE


COMMANDS TO BE GENERATED

* COMMANDS FOR LAB9907

* COMMENT LINES BEGIN WITH *

* AFTER LAST COMMENT LINE, BEFORE A COMMAND, YOU MUST HAVE: .

* NOTE - CHANGE THE DIRECTORY DEPENDING ON WHERE YOU HAVE YOUR FILES

* FOR EXAMPLE, I HAVE:

* D:\PCW\GRADSTAT

* AS THE LOCATION FOR MY FILES

* YOU MIGHT WANT TO JUST HAVE YOURS ON A

* OR WHATEVER THE NAME IS OF YOUR SUBDIRECTORY

* .

GET

FILE='A:\USDANU11.SAV'.

EXECUTE .

* THIS NEXT COMMAND WILL RUN THE BIVARIATE PLOT

* .

GRAPH

      /SCATTERPLOT(BIVAR)=violra85 WITH violra90

     /MISSING=LISTWISE

     /TITLE= '1985 Violent Crime Rate Predicted from 1985 Rate' 'State Level'+

     ' Data File'.

*

* NOW WE ARE GOING TO SORT THE DATA FILE BY OUR PREDICTOR VARIABLE

* .

SORT CASES BY

     violra85 (A) .

* SAVING THE FILE WITH A NEW NAME AFTER HAVING DELETED EXTRA VARIABLES

* .

SAVE OUTFILE='A:\LAB9907.SAV'

     /COMPRESSED.

********************************

*

* NOW RUNNING THE REGRESSION

* NOTICE I AM TELLING IT TO SAVE THE RESIDUAL STUFF WE WANT

* I AM GOING WITH DEFAULT STATISTICS AND OPTIONS OTHERWISE-

* .

REGRESSION

     /MISSING LISTWISE

     /STATISTICS COEFF OUTS R ANOVA

     /CRITERIA=PIN(.05) POUT(.10)

     /NOORIGIN

    /DEPENDENT violra90

    /METHOD=ENTER violra85

    /SAVE PRED COOK LEVER RESID SRESID .

* BE SURE TO SAVE YOUR DATA FILE AGAIN AFTER THE REGRESSION HAS RUN

*

* EXTREMELY IMPORTANT

**

* NOW I AM GOING TO LIST OUT THE X VARIABLE, THE Y VARIABLE, AND SOME OF THE

* RESIDUAL INFORMATION

* USE THE LIST COMMAND UNDER SUMMARIZE

* .

LIST

      VARIABLES=statnam violra85 violra90 pre_1 res_1 sre_1

     /CASES= BY 1

     /FORMAT= WRAP UNNUMBERED .

******

* NOW WILL GET HISTOGRAM AND STATISTICS FOR RESIDUALS

* I HAVE TURNED OFF THE TABLE

* BUT HAVE ASKED FOR A HISTOGRAM

* .

FREQUENCIES

     VARIABLES=res_1 /FORMAT=NOTABLE

     /STATISTICS=STDDEV MEAN SKEWNESS SESKW KURTOSIS SEKURT

    /HISTOGRAM .

*

*

* NOW WE WILL LOOK AT A SCATTERGRAM OF OUR PREDICTOR SCORES AND RAW RESIDUAL

* SCORES

* .

GRAPH

      /SCATTERPLOT(BIVAR)=violra85 WITH res_1

     /MISSING=LISTWISE .


COMMANDS FOR INDIVIDUAL LEVEL DATA FILE


 

******************************

*

* This first set of commands gets the individual data file

* and saves principal component scores

* You should go in to the file afterwards and change the two

* new variables to CONFI1 and CONFI2.

* BE SURE to change the names of the files it retrieves and saves

* so it knows what your files are

* We will be talking more about principal components analysis

* and factor analysis later in the semester

*

******************************

* .

GET

     FILE='A:\umet99ee.sav'.

EXECUTE .

FACTOR

      /VARIABLES q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 /MISSING

       MEANSUB /ANALYSIS q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30

      /PRINT INITIAL EXTRACTION ROTATION

      /FORMAT SORT

      /PLOT EIGEN

      /CRITERIA FACTORS(2) ITERATE(25)

     /EXTRACTION PC

     /CRITERIA ITERATE(25)

     /ROTATION VARIMAX

    /SAVE REG(ALL)

    /METHOD=CORRELATION .

***************************************************

* GO IN HERE AND MANUALLY LABEL THE PRINCIPAL COMPONENT SCORES

* I call them CONFI1 and CONFI2

* CONFI1 captures big institutions

* CONFI2 captures cj incl courts, police, cjs, and president 9hmm)

***************************************************

***************************************************

* Here is an alternate procedure you can try if the principal components analysis does not

* appear to work.

* Here are the steps:

* 1. Save z scored versions of variables Q21 through Q27 using the DESCRIPTIVES procedure

* 2. Save your data file again

* 3. compute a new index based on the average of those z scores

* Your compute statement will say:

*             COMPUTE CONFI1=MEAN.3(ZQ21,ZQ22,ZQ23,ZQ24,ZQ25,ZQ26,ZQ27) .

*             EXECUTE .

***********************************************

* END OF ALTERNATE PROCEDURE

***********************************************

* .

SAVE OUTFILE='A:\UMET99FF.sav'

      /COMPRESSED.

*****************************

*

* Run a regression where CONFI1 is the outcome and AGE (q2_3) is

* the predictor.

* We will save studentized, predicted score, and residual score

*

*****************************

* .

REGRESSION

      /DESCRIPTIVES MEAN STDDEV CORR SIG N

      /MISSING LISTWISE

      /STATISTICS COEFF OUTS R ANOVA

      /CRITERIA=PIN(.05) POUT(.10)

      /NOORIGIN

      /DEPENDENT confi1

      /METHOD=ENTER q2_3

      /SAVE PRED RESID SRESID .

*******************

*

* Here is what spss is going to call the three new variables

* it saves:

* PRE_2 Predicted Value

* RES_2 Residual

* SRE_1 Studentized Residual

*

*******************

* .

 

* NOW WILL GET HISTOGRAM AND STATISTICS FOR RESIDUALS

* I HAVE TURNED OFF THE TABLE

* BUT HAVE ASKED FOR A HISTOGRAM
* NOTE: The variable name res_2 (below) may be different depending on how many residuals
* you already have in your file.

* .

FREQUENCIES

     VARIABLES=res_2 /FORMAT=NOTABLE

     /STATISTICS=STDDEV MEAN SKEWNESS SESKW KURTOSIS SEKURT

     /HISTOGRAM .

*

* NOW WE WILL LOOK AT A SCATTERGRAM OF OUR PREDICTOR SCORES AND RAW RESIDUAL

* SCORES

* .

GRAPH

     /SCATTERPLOT(BIVAR)=q2_3 WITH res_2

     /MISSING=LISTWISE .

SAVE

     OUTFILE='A:\UMET99GG.SAV' /MAP .

EXECUTE .


SOME MULTIPLE REGRESSIONS TO RUN


* From the individual level data file

* .

REGRESSION

      /DESCRIPTIVES MEAN STDDEV CORR SIG N

      /MISSING MEANSUB

      /STATISTICS COEFF OUTS R ANOVA

      /CRITERIA=PIN(.05) POUT(.10)

      /NOORIGIN

     /DEPENDENT control2

     /METHOD=ENTER q1 white q78 q73 q2_3

     /CASEWISE PLOT(ZRESID) OUTLIERS(3) .