FI=LAB9907
Lab (and homework):
Examining Residuals, Plotting Residuals, Getting Rid of Outliers
OBJECTIVES. In this lab you will
1) save residuals and various residual diagnostics
from a simple regression run
2) look at the residuals
3) decide if they meet or violate the assumptions
about residuals
4) consider steps to take when you have a troubling
case, i.e., a case that is markedly influencing the results of a regression.
5) Repeat steps 1-5 again with a different problem
6) In this lab we are going to try and wean ourselves
from MOUSE CLICKS. What follows at the end of this lab is the
COMMANDS you want the program to generate. I clearly indicate what
each section of the different commands accomplish. What you want to
do is to see if you can figure out how to tell it to do what you are asking.
You can CHECK and see what commands you are generating
by looking at the OUTPUT file - your commands appear there.
If you are running a FULL version of SPSS you could
just submit these commands in a syntax box, and it would run
|
NOTE ABOUT USING PORTIONS OF THE COMMANDS AS A SYNTAX FILE: I think I have figured out why the command files would not run. The browser does not save spaces like tabs unless they are hard spaces. I have tried this time around to put in hard spaces, so the program can distinguish between commands and subcommands. |
EXAMPLE 1:
Predicting 1990 State Reported Violent Crime Rate from the 1985 Rate
You want to see whether, in terms of reported violent crime rates, there are some particular states that increased a lot, compared to others. Are there some that decreased a little, compared to others? In o ther words, what states increased MORE than predicted? What states increased LESS than predicted?
We have established earlier, by means of a dependent
t-test, that the 1990 rate (average = 534) was significantly higher
than the 1985 rate (average=406). So we know that on average the
reported violent crime rate went up for the states.
Since skewness of these two variables is within
acceptable limits (around .5) we do not need to do any transformations.
(Of course, you would always look first at your
univariate statistics to see if there are any problems.)
You examine your bivariate plot and the data suggest
there are no outliers.
So now you want to run the simple regression.
Before we start we are going to sort the file by the
independent variable so that particular cases will be easier to find.
Now we are going to run the regression, and SAVE
several features of the residuals. We are particularly interested in:
|
VARIABLE NAME |
WHAT SPSS WILL CALL IT |
|
Unstandardized predicted values |
PRE_1 Predicted Value |
|
Unstandardized residual values |
RES_1 Residual |
|
Cook's D (distance) |
SRE_1 Studentized Residual |
|
Leverage |
COO_1 Cook's Distance |
|
Studentized Residuals |
LEV_1 Leverage |
|
NOTE. The numeric extension SPSS adds to the end of the names of these variables WILL VARY. If these are the first ones being put into the data file, their extension will be _1. But if they are the second their extension will be _2. And so on. The only way to recall what these residuals and residual diagnostics correspond to IS TO ADD VARIABLE LABELS TO THESE VARIABLES AS SOON AS THEY ARE CREATED. See your SPSS manual on how to do this. It is not hard. Ask me in lab if you are still having trouble figuring it out. But when you start running lots of regressions and saving lots of residuals and residual diagnostics, it can get ugly fast. |
|
We are not going to get any plots at this moment, but will get some
plots with the residuals
AFTER RUNNING THE REGRESSION:
* At the end of the output it tells you the names of
the new variables it is saving
* Save the file again with its new name
AN EXPLANATION OF THE NEW VARIABLES YOU HAVE ADDED TO YOUR FILE
Here is what they mean. (See NOT9907 for more details.)
UNSTANDARDIZED PREDICTED VALUE [PRE_1]
The Ypredicted score as defined by the
equation that generated the estimate. In this regression
VIOLRA90predicted = 14.319+(1.278*VIOLRA85)
The set of points (X,Ypredicted) are the
points on the regression
line Ypredicted = A + B*Xc
UNSTANDARDIZED RESIDUAL [RES_1]
These are the raw residual scores. Remember:
Yresidual = Yactual - Ypredicted
On a regression plot, for each case, it is the vertical
distance between an (Xactual, Yactual)
data point and the regression line (Xactual, Ypredicted).
Since it is a vertical distance, X stays the same, and we can drop
it out. So it is the distance between Yactual and Ypredicted.
LEVERAGE [LEV_1]
"Cases with high leverage pull the regression
surface [line] toward them more than other cases do" (Darlington
p. 356; see also Hamilton pp. 130-131) It is most relevant when you
have two or more predictors. (We are not there yet. When you have
more than one predictor it tells you about the extent to which a case
has an atypical pattern of scores on the predictor variables.
) When youhave one predictor leverage can be high if you have a case
scoring much higher or much lower on X than the other cases.
Leverage can range from 1/n to 1. As the number gets
higher, the case is pulling the regression line or surface toward it
more than other cases are.
People disagree on what cutoff should be used for
leverage being too high. Hamilton suggests that if the highest
leverage in the sample is between .2 and .5, the sample is
"risky" in that one case is heavily pulling the regression
line [surface] toward it.
In this example, the case flagged in your output as
having large leverage (.14) was New York, which among all the states
had the highest 1985 reported violent crime rate. Note that the
regression line goes exactly through New York. So although NY has a small
residual it has large leverage.
COOK'S D[COO_1]
COOKS D provides a measure of influence - how
much that data point actually effects the position of the regression
line (or plane). It tells you how much the regression line (plane)
moves if that case is deleted. (See Hamilton, p. 132; Darlington p.
345 on)
It measures influence on the model as a whole; so when
we get to multiple predictors if you eliminate a case with a high D
it can change the coefficients of not just one predictor, but the
coefficients of several.
The higher the D, the more influential the case.
See Hamilton's figure 4.14, p. 131.
Unusually influential cases will have D > 4/n.
(Hamilton p. 132) So, with our 50 cases here D > 4/50 = D > .125.
In other words you might want to be suspicious if you
have a case with Cook's D larger than this. You should at least look
at the case if it does have D greater than the cutoff value, and see
where it is in the plot.
STUDENT
These are studentized residuals. (See Hamilton p.
132.) For each case the studentized residual provides a scale-free
measure of distance from the regression line (or plane when we get to
multiple regression). To see if significant you do a single sample
t-test with N-p-1 degrees of freedom. (See Darlington p. 357; for
more on the t-test see Hale.) BUT YOU NEED TO MAKE THE BONFERRONI
CORRECTION - we will talk about it; see below.
The studentized t takes into account not only the size
of the residual, but also its leverage. STUDENT increases as:
-- the residual increases
-- the leverage of that case increases
With STUDENT you can test the hypothesis: (see
Hamilton p. 132):
if case i were eliminated, would there be a
significant shift in the intercept of the regression line?
BUT if you want to do a significance test, i.e.,
really test this hypothesis, you must adjust your alpha level to
control for the number of t-tests you are doing. This is the
BONFERRONI CORRECTION. See Darlington p. 358.
You do this by dividing your alpha level by the number
of tests you are making. After all, if you had 50 t-tests, simply by
chance, what would be the number of significant t-values you would
get at p < .05 (two tailed)?
If you run a frequency distribution on STUDENT you
will see that you have 3 t-values significant at p < .05.
Your adjusted alpha level should be (alpha/n of
t-tests), so if you are working with 50 cases:
adjusted alpha = .05/50 = .001.
At this alpha level how many significant t-values do
we have?
WHICH OF THESE DO I NEED TO KNOW.
For now, you need to understand PREDICTED, RESIDUAL,
STUDENT, AND LEVERAGE. I will not hold you responsible for COOK.
LET'S FIGURE OUT WHICH CASE IS WHICH.
We are going to list some scores for some variables
using the LIST command Let's do:
An Identifier
Our Predictor
Our Outcome
Our Predicted Y
Our Residual
Studentized Residuals
i.e., STATNAM VIOLRA85 VIOLRA90 PRE_1 RES_1 SRE_1
HOW TO EXPLORE RESIDUALS
The first thing to do is to plot them. Do a histogram,
getting the descriptive statistics as well. You also could do stem
and leaf plots, if you wished, or box and whisker plots.
But even more important is to look at is the
relationship between the predictor and the residuals. Run this
scattergram. When you look at this plot you are directly examining
two assumptions in regression. These are important assumptions that
you want to verify
- equal variance of residuals at different levels of X values
- X values independent of residual values, across the cases
We can talk later in class about what plots of
violations of this assumption would look like. The important point is
that if these assumptions are violated, statistical tests are inappropriate.
EXAMPLE 2
DATA FILE: Individual level file
PURPOSE:
* To create principal components scores. These are
index scores. Look in the main menu under data reduction. DEPENDING
ON THE VERSION OF SOFTWARE YOU HAVE, YOU MIGHT NOT HAVE THIS OPTION,
SO YOU WANT TO BE SURE TO RUN THIS IN THE LAB. We are telling it to
focus on a subset of confidence variables (Q20 through Q30), and then
to come up with 2 common factors, and save the principal component
scores. Don't worry about not having the background to do this
now. We will get it later. But just be sure you run it, get two
factors, and save the scores. It gives
us a nice outcome with some interesting predictors. If you have
problems getting these created there is an alternate procedure noted.
* To predict CONFI1 -
confidence in public institutions - using age.
* Look at the results and at the residuals.
* Use the residuals plotted, and residual diagnostics,
to decide if we are violating some important assumptions behind regression.
HOMEWORK FOR NEXT TIME
Focusing on the descriptive statistics for the
residuals, the histogram of the residuals, and the plot of X *
residuals, describe whether or not the assumptions about error terms
that are part of regression appear to be met. When you are describing
the scatteplot, be specific, naming particular data points that seem
particularly important to you. When describing the histogram of
residuals, likewise, be specific.
Also, be specific about which assumption you are
referring to.
Attach all charts.
No more than 4 pages typed, double spaced
Do this for BOTH regressions.
In addition you should run the following multiple regressions, and bring these printouts to class with you. You do not need to write anything about them. Just run them, and, after having read NOT9908, have a shot at interpreting them.
COMMANDS FOR ECOLOGICAL DATA FILE
COMMANDS TO BE GENERATED
* COMMANDS FOR LAB9907
* COMMENT LINES BEGIN WITH *
* AFTER LAST COMMENT LINE, BEFORE A COMMAND, YOU MUST HAVE: .
* NOTE - CHANGE THE DIRECTORY DEPENDING ON WHERE YOU HAVE YOUR FILES
* FOR EXAMPLE, I HAVE:
* D:\PCW\GRADSTAT
* AS THE LOCATION FOR MY FILES
* YOU MIGHT WANT TO JUST HAVE YOURS ON A
* OR WHATEVER THE NAME IS OF YOUR SUBDIRECTORY
* .
GET
FILE='A:\USDANU11.SAV'.
EXECUTE .
* THIS NEXT COMMAND WILL RUN THE BIVARIATE PLOT
* .
GRAPH
/SCATTERPLOT(BIVAR)=violra85 WITH violra90
/MISSING=LISTWISE
/TITLE= '1985 Violent Crime Rate Predicted from 1985 Rate' 'State Level'+
' Data File'.
*
* NOW WE ARE GOING TO SORT THE DATA FILE BY OUR PREDICTOR VARIABLE
* .
SORT CASES BY
violra85 (A) .
* SAVING THE FILE WITH A NEW NAME AFTER HAVING DELETED EXTRA VARIABLES
* .
SAVE OUTFILE='A:\LAB9907.SAV'
/COMPRESSED.
********************************
*
* NOW RUNNING THE REGRESSION
* NOTICE I AM TELLING IT TO SAVE THE RESIDUAL STUFF WE WANT
* I AM GOING WITH DEFAULT STATISTICS AND OPTIONS OTHERWISE-
* .
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT violra90
/METHOD=ENTER violra85
/SAVE PRED COOK LEVER RESID SRESID .
* BE SURE TO SAVE YOUR DATA FILE AGAIN AFTER THE REGRESSION HAS RUN
*
* EXTREMELY IMPORTANT
**
* NOW I AM GOING TO LIST OUT THE X VARIABLE, THE Y VARIABLE, AND SOME OF THE
* RESIDUAL INFORMATION
* USE THE LIST COMMAND UNDER SUMMARIZE
* .
LIST
VARIABLES=statnam violra85 violra90 pre_1 res_1 sre_1
/CASES= BY 1
/FORMAT= WRAP UNNUMBERED .
******
* NOW WILL GET HISTOGRAM AND STATISTICS FOR RESIDUALS
* I HAVE TURNED OFF THE TABLE
* BUT HAVE ASKED FOR A HISTOGRAM
* .
FREQUENCIES
VARIABLES=res_1 /FORMAT=NOTABLE
/STATISTICS=STDDEV MEAN SKEWNESS SESKW KURTOSIS SEKURT
/HISTOGRAM .
*
*
* NOW WE WILL LOOK AT A SCATTERGRAM OF OUR PREDICTOR SCORES AND RAW RESIDUAL
* SCORES
* .
GRAPH
/SCATTERPLOT(BIVAR)=violra85 WITH res_1
/MISSING=LISTWISE .
COMMANDS FOR INDIVIDUAL LEVEL DATA FILE
******************************
*
* This first set of commands gets the individual data file
* and saves principal component scores
* You should go in to the file afterwards and change the two
* new variables to CONFI1 and CONFI2.
* BE SURE to change the names of the files it retrieves and saves
* so it knows what your files are
* We will be talking more about principal components analysis
* and factor analysis later in the semester
*
******************************
* .
GET
FILE='A:\umet99ee.sav'.
EXECUTE .
FACTOR
/VARIABLES q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30 /MISSING
MEANSUB /ANALYSIS q20 q21 q22 q23 q24 q25 q26 q27 q28 q29 q30
/PRINT INITIAL EXTRACTION ROTATION
/FORMAT SORT
/PLOT EIGEN
/CRITERIA FACTORS(2) ITERATE(25)
/EXTRACTION PC
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG(ALL)
/METHOD=CORRELATION .
***************************************************
* GO IN HERE AND MANUALLY LABEL THE PRINCIPAL COMPONENT SCORES
* I call them CONFI1 and CONFI2
* CONFI1 captures big institutions
* CONFI2 captures cj incl courts, police, cjs, and president 9hmm)
***************************************************
***************************************************
* Here is an alternate procedure you can try if the principal components analysis does not
* appear to work.
* Here are the steps:
* 1. Save z scored versions of variables Q21 through Q27 using the DESCRIPTIVES procedure
* 2. Save your data file again
* 3. compute a new index based on the average of those z scores
* Your compute statement will say:
* COMPUTE CONFI1=MEAN.3(ZQ21,ZQ22,ZQ23,ZQ24,ZQ25,ZQ26,ZQ27) .
* EXECUTE .
***********************************************
* END OF ALTERNATE PROCEDURE
***********************************************
* .
SAVE OUTFILE='A:\UMET99FF.sav'
/COMPRESSED.
*****************************
*
* Run a regression where CONFI1 is the outcome and AGE (q2_3) is
* the predictor.
* We will save studentized, predicted score, and residual score
*
*****************************
* .
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT confi1
/METHOD=ENTER q2_3
/SAVE PRED RESID SRESID .
*******************
*
* Here is what spss is going to call the three new variables
* it saves:
* PRE_2 Predicted Value
* RES_2 Residual
* SRE_1 Studentized Residual
*
*******************
* .
* NOW WILL GET HISTOGRAM AND STATISTICS FOR RESIDUALS
* I HAVE TURNED OFF THE TABLE
* BUT HAVE ASKED FOR A HISTOGRAM
* NOTE: The variable name res_2 (below) may be different depending on
how many residuals
* you already have in your file.
* .
FREQUENCIES
VARIABLES=res_2 /FORMAT=NOTABLE
/STATISTICS=STDDEV MEAN SKEWNESS SESKW KURTOSIS SEKURT
/HISTOGRAM .
*
* NOW WE WILL LOOK AT A SCATTERGRAM OF OUR PREDICTOR SCORES AND RAW RESIDUAL
* SCORES
* .
GRAPH
/SCATTERPLOT(BIVAR)=q2_3 WITH res_2
/MISSING=LISTWISE .
SAVE
OUTFILE='A:\UMET99GG.SAV' /MAP .
EXECUTE .
SOME MULTIPLE REGRESSIONS TO RUN
* From the individual level data file
* .
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING MEANSUB
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT control2
/METHOD=ENTER q1 white q78 q73 q2_3
/CASEWISE PLOT(ZRESID) OUTLIERS(3) .