FI=NOT99905

 

NOTICE. Unless otherwise indicated, all materials on this page and linked pages at the blue.temple.edu address and at the astro.temple.edu address are the sole property of Ralph B. Taylor and © 1999 by Ralph B. Taylor. All users have the right to freely access and copy these pages provided that they: acknowledge the source, do not make changes on any pages, and do not charge more than copying costs for distribution.

 

Notes to Accompany Basic Formulas in Simple Regression

Ralph B. Taylor

all materials copyright © 1999 by Ralph B. Taylor

These notes introduce you to the fundamental concepts we use when we examine how scores on one variable relate to scores on a different variable. We start off considering two kinds of theoretical relations between variables. We then consider relations where the scores on the first variable, and transformation values, determine scores on a second variable, and contrast this situation with instances where error also determines scores on the second variable. We then turn to covariation and covariance, statistical concepts allowing us to learn the strength of association between two variables. Finally, we review the basics of the equation for simple regression.

Following these notes you will find a few pages of definitions. There are links to a worksheet showing the relevant calculations

 

Theoretical status

We now have started looking at pairs of variables in scatterplots. There are two possible CONCEPTUAL relationships between the two variables in a pair.

Correlation. First, both may be of equal status. One is not assumed to be theoretically prior to the other. They are both just two variables that happen to be associated with one another, and you are interested in how strongly they are associated. For example, you want to know what the correlation is between the 1985 property crime rate and the 1985 violent crime rate at the state level. You do not necessarily think that one variable causes the other.

In these situations you are interested in the correlation or covariation between the two variables. Correlation and covariation are measures of association, and you can see their definition in the list at the end. You may want to examine the scatterplots, and see how the variables correlate.

Regression. Alternately, you may be interested in how well one variable predicts the other. One variable is the predictor, another is the outcome. It is only when you have two variables, one that is, for theoretical reasons, a predictor, and one that is, for theoretical reasons, an outcome, that regression is appropriate. The example given in the other notes is a case in point: we assume that the economic level in the state (AVGPAY) "drives" the property crime rate. Theory says one is the cause (X) variable and the other is the outcome (Y) variable.

The foundational logic of regression is one or more predictors, and one outcome. In simple regression we concentrate on just one predictor. When we move on to multiple regression we have more than one predictor.

 

Functional vs. Statistical Relationships

The mathematics of the relationship between the two variables can be one of two varieties: functional or statistical.

Functional. In the case of a functional relationship, the two variables, X and Y, are related by a specific mathematical function or transformation. The Y variable is defined solely in terms of corresponding X values and the transformations made to those values.

We have already looked at a large number of pairs of variables where the relationship between the members of the pair is a functional relationship. For example:

LVINUM85=log(1+VIOL85)

LVIRA85=log(1+VIOLRA85)

QPRRA85=(PROPRA85).5

QPRNUM85=(PROPER85).5

When two variables have a functional relationship, several things follow.

1. When you look at a scatterplot of the two variables, the X and Y values will always form a perfect line or a perfect curve. All the points will fall on that line or that curve.

2. The correlation between the two variables will always be +1.0 or

-1.0. As an example of the latter, consider the relationship

Y=-1*[LOG(1+X)]. A situation of linear dependence exists between the two variables when the correlation between them is perfect (r=-1.0 or r=1.0).

3. Because the correlation between Y and X is perfect (-1.00 or +1.00), this means in effect that the variance of Y is completely predictable from the variance of X, and vice versa. If you imagined the variance of Y as a circle, and the variance of X as a separate circle, the two would be perfectly set one on top of another; there is 100% overlap.

4. The relationship between X and Y can be described by the following equation:

Y = A + B*XC

where A, B, and C are different constants that define different relationships. When the relationship is functional you the researcher are intentionally setting each of the constants.

Here are some examples phrased in terms of SPSS COMPUTE statements, and the associated values for the constants. You would substitute in the names of specific variables for X and Y.

TRANSFORMATION

A=

B=

C=

compute Y=X+3

3

1

1

compute Y=5+X

5

1

1

compute Y=(.120*X)

0

.12

1

compute Y=(-.15*X)

0

-.15

1

compute Y=(-.001*X)-32

-32

-.001

1

compute Y=sqrt(X)

0

1

.5

compute Y=.375+X**.333

.375

1

.333

compute Y=4.12+1.02X**2

4.12

1.02

2

NOTICE: The bigger b gets, the steeper the line gets; the smaller b gets, the flatter the line gets.

NOTICE FURTHER: As c gets away from 1, the line starts to curve; you have a line that curves upward (c > 1) or a line that curves downward (c < 1).

Statistical. Alternately, two variables may have a statistical relationship, where the values of the outcome variable on a case are partly determined by the values of the predictor variable on the same case, and partly determined by other factors that we have not measured, or have not included in our equation. We lump these other factors under error and represent it with e or E.

Error may subsume many different things. Other causes. Measurement error. We will be talking a lot later about error in regression, and about error terms.

The relationship between X and Y is described by equations like this:

Y = A + B*XC + E

where again, A, B, and C represent constants that are derived. E simply represents an error term. E is not a constant, but has a specific value for each case. We will talk later about how you find out what that value is.

Let's talk a bit more about each of these constants. As with a functional relationship:

A = the value of Y at X=0. This data point (0,A) is referred to as the Y INTERCEPT.

B = the slope of Y/X, expressed in the "raw" units of each variable. B is an extremely critical constant, and we will spend a lot of time deriving and interpreting B weights, doing statistical tests of them, and so forth. There are several different ways that you can think of the B constant.

* RISE/RUN. As you go out 1 unit on the X axis, how many units do scores increase on the Y axis?

* SLOPE. B reflects the slope of the regression line shown in a scatterplot. (We will get to the regression line shortly. It is a line that you use to describe the relationship between X and Y. It has a number of special properties.) If B=0, then the line is perfectly flat. We would say that it does not matter what the X value is, Y values neither increase nor decrease as X values increase or decrease.

 

Note that throughout the slope is discussed in terms of the units in which the variables are originally expressed. That is why another word for the slope is the UNstandardized regression coefficient. We will later talk about this in standardized terms, but not yet.

Here are some examples of intercepts and slopes from actual SPSS runs using the US DATA ecological file.

X

Y

A

B

POP85

PRISNR85

81.346

1.728

POP85

PROPER85

-32.02

0.052

How would you interpret each?

Turning functional into statistical relationships. You can turn a functional relationship into a statistical one by introducing error. We can introduce error by using random functions in SPSS such as RV.NORMAL (random variable, normally distributed) and RV.UNIFORM (uniform random numbers)

Here is a functional relationship:

 

COMPUTE QVINUM85=SQRT(VIOL85) .

 

Here is a statistical relationship:

 

COMPUTE SR1=SQRT(VIOL85)+(10*RV.UNIFORM(0,1)) .

 

We have now introduced error through RV. But how much have we introduced? What is the relationship between the portion of Y due to URN and the portion of Y due to VIOL85? Will the XY plot of SR1*VIOL85 still have data points close to the curve? Does it still look like a functional relationship?

Now suppose that you did a function like these

 

COMPUTE SR2=SQRT(VIOL85)+(100*RV.UNIFORM(0,1)) .

COMPUTE SR3=SQRT(VIOL85)+(500*RV.UNIFORM(0,1)) .

 

Now what happens when you plot the original variable (VIOL85) against the new transformed+random variable?

The points that we are trying to get to here are two. First, that you can purposely introduce error into a functional relationship, and transform it into a statistical relationship. As soon as you have done this the scatterplots start to look different, but only if you introduce a substantial amount of error relative to the rest of the Y variable. Second, the amount of error that can be in a Y*X relationship can vary. It can be a small amount. Or it can be a large amount.

Usually in an X-Y relationship we do not introduce error, but we DO try to find out exactly how much error there is in the relationship. We will be seeing a little later how this is done.

 

More on Covariation

Covariation is a measure of how much x deviation scores and y deviation scores vary together

(Sum(x*y))

Covariance is average covariation

((Sum(x*y))/n)

See linked worksheet for a calculation example.

Remember here that we are working in deviation scores (x=X-Xavg.)

To help visualize what covariation is about, a little bit better, you can take a scatterplot and divide it into different sections. Divide the X or horizontal axis into an area below Xavg. and an area above Xavg.. You can do likewise for Y, dividing it into an area below the Y mean, and an area above the Y mean. We now have four sections of a scatterplot:

 

 







Y AXIS

above Y mean

&

below X mean

REGION D

above Y mean

&

above X mean

REGION A

below Y mean

&

below X mean

REGION C

below Y mean

&

above Y mean

REGION B

X AXIS

OBSERVE:

regions A, C:(x*y) results in positive values

regions B, D:(x*y) results in negative values

If we have a positive relationship between X and Y, then most of the data points in the scatterplots will fall into the A and C regions of the graphs, resulting in a total covariance value that is positive. When you have an X value below its mean you are likely to have a Y value below its mean, and vice versa. If Y is above its mean, X probably will be above its mean too.

Conversely, if we have a negative relationship, most of the data points will fall into the B and D sections of the graph resulting in a total covariance that is negative. Here, if you have an X value above its mean you will have a Y value BELOW its mean, and vice versa.

So a key factor that determines covariance is the relative number of data points across these above four sections of the scatterplot.

Take a look at the scatterplot of PROPRA85*AVGPAY from the ecological data file. Find the mean on X and the mean on Y, and divide the plot up into quadrants. Yes, just draw two lines on the plot.

You will see that the majority of the data points fall into the top right (A) and bottom left (C) sections of the plot. Each (x*y) value in these sections of the plot will result in a positive contribution to total covariation.

Another key factor influencing the total covariation value is how far away specific points are from their respective mean. If you have an (x,y) data point that is far away from Xavg. and from Yavg. it may multiply out to an extremely large value. If it is in regions A or C it will be a positive value, if in regions B or D a negative value. So you have to think not only of the region, but the actual values. Look at the table from the spreadsheet. Look at Alaska for example and its associated deviation scores, compared to the average deviation scores.

 

Why are we interested in the degree of covariation?

The reason we calculate the covariation, covariance, and slope, is that we are interested in the degree of association between two variables. We want to know, for descriptive purposes or for theoretical purposes, exactly what that slope is. We want to be able to characterize the slope as either strongly negative, weakly negative, weakly positive, or strongly positive.

In addition, we also may have theoretical propositions that we want to test. For example, we may have a null hypothesis that says that the slope of PROPRA85 on AVGPAY is 0, and we may want to attempt to reject that null hypothesis.

A NOTE ON LANGUAGE. When we discuss slope we say "the slope of Y on X" even though we know that Y is the dependent variable.

Putting both variables on the same metric

AVGPAY is measured in dollars, and PROPRA85 is measured in reported property crimes per 100,000 people in 1983 at the state level. Our interpretation of the b weight is in terms of these units. For every increased dollar in average pay we see a corresponding increase in the reported property crime rate of .206 crimes per 100,000 persons. Or stated differently, for every $5 increase in average annual pay, there is one increased reported property crime per 100,000 residents.

But suppose we want to standardize things. We can do this by transforming the X and Y scores into z-scored variables. Remember the formula for a z score is

ZX = X - Xavg.

---------

sX

If scores on a variable are transformed into z scores, the mean of the transformed scores are 0, and the standard deviation is 1.0. So one way we can get to a "standardized" slope is by transforming our variables into z scored variables, and re-calculating what the unstandardized regression coefficient (the slope) would be. This would now be called a standardized regression coefficient.

We can also use a shortcut to get there. W "equate" the X and Y variables by multiplying the slope by the ratio of sX / sY . What we are doing here is standardizing the different standard deviations relative to each other.

In regression the standardized regression coefficient is referred to as "beta" ()

When interpreting the beta weight you want to remember that the units have been standardized, and this changes your interpretation. The relationship is no longer expressed in raw units, but rather in standard deviations.

So if we have a = +.466 when we regress PROPRA85 on AVGPAY we interpret it as follows:

For every standard deviation unit increase in AVGPAY [X VARIABLE], we see a corresponding increase of .466 standard deviations in PROPRA85 [Y VARIABLE].

Alternatively you could say:

For every standard deviation increase in AVGPAY, scores on PROPRA85 increase 47% of a standard deviation.

Here are some other results using ecological data file for you to interpret.

X

Y

POP85

PRISNR85

.945

POP85

PROPER85

.979

In simple regression, but only in simple regression, the weight is also equivalent to the correlation between x and y (rXY). The "straight" correlation is also sometimes called the "0-order" correlation, because the influence of additional variables has not been removed from the X-Y relationship. We will talk how to do this a bit later.

 

The simple correlation has an important property. When you square this value it represents:

 

- the extent to which the covariance of X and Y overlaps with the variance of Y

- the percentage overlap between the variance of X and the variance of Y;

- the proportion of the variance of the Y variable that is explained by corresponding scores on the X variable, i.e., how much of Y is explained by X.

All of these are acceptable interpretations of r2 when we are looking at a simple regression. The interpretation, however, gets slightly different when we get to multiple regression.

 

The Regression Line

Once we have calculated the unstandardized regression coefficient and the constant, we can use this information to draw a line. We talked earlier about transformations where

Y = A + B*XC

We will leave out the C constant, representing power functions. We are just going to set it to 1 and leave it for a bit. When A= the constant, i.e., the Y value when X = 0, and when B = the unstandardized regression coefficient or the slope, the Y values that are defined by the equation

Y=A+B*X

are now a very special set of Y values. They are the Y values that you would obtain if, given the covariation you have observed between X and Y, Y values were perfectly predictable from X values.If this were the case the values would fall exactly on the line. They are Ypredicted values, so for each case:

Ypredicted i = A + BXi

Note: we have reduced a statistical relationship to a functional relationship. The variables X and Y describe a statistical relationship. The variables X and Ypredicted describe a functional relationship. The difference between the two is the error. For each ith case

Yactual i = A + BXi + ei

Each case has a certain amount of error associated with it, as represented by this term ei. It is also termed a residual and it represents prediction error (Hamilton p. 33), our inability to correctly predict Y scores given the corresponding X scores.

 

Each residual for each ith case is defined as

 

ei = Yactual i - Ypredicted i

 

See Figure 2.5, p. 38 in Hamilton for a graphical portrayal. If a case has 0 error, it falls exactly on the line. The farther a case is from the regression line, the larger its residual.

Glossary

beta

constant

correlation

correlation

covariance

covariation

deviation scores

error

functional relationship

intercept

residual

rise/run

slope

standardized regression coefficient

statistical relationship

unstandardized regression coefficient

 

REVIEW OF BASIC FORMULAS

We are going to review some of the basic formulas used in bivariate regression. We will do this by means of a spreadsheet, where the columns are variables, and the rows are cases. (Note: you could do exactly this in SPSS using COMPUTE to define new variables.) The spreadsheet on which this is based appears in a separate linked file. Column numbers here refer to column numbers in the spreadsheet itself (see handout)

CLICK HERE TO SEE THE WORKSHEET

Two variables of interest. We focus on AVGPAY, the average annual pay per worker in 1983, and PROPRA85, the rate of property crimes reported per 100,00 persons in 1983. All measures are US state level measures. AVGPAY is the X or predictor variable and PROPRA85 is the Y or dependent variable.

Formulas to derive

DEVIATION SCORES

(x, y) NOTE: raw scores are capitalized (X, Y); deviation scores are lower case (x, y)

For each case: x = X - Xavg. y = Y - Yavg. [Cols. 3 & 6]

SUMS OF SQUARES

You square each deviation score, and add them up. [Col. 4 & 7]

What you are getting for each variable is the sum of squared differences around the respective variable's mean.

VARIANCE

VARIANCE (X) = ( Sum (x2) / N ) [Col. 4]

VARIANCE (Y) = ( Sum (y2) / N ) [Col. 7]

STANDARD DEVIATION

STANDARD DEVIATION (Sx) Since VAR (X) = Sx2 then Sx = Sq. Rt. (VAR (X))

 

Alternatively Sx = ( (Sq. rt (Sum (x2)) / (Sq. rt. N) )

COVARIATION

COVARIATION -- a measure of how X and Y vary together - (Sum (xy)) [Col. 8]

COVARIANCE

COVARIANCE (COVxy) -- a measure of the average covariation: ((Sum (xy)) / N )

SLOPE

SLOPE: The amount of Y variable units that the Y variable changes as the X variable increases by one unit of X variable measurement.

b = (COVxy) / VARIANCE (X) [Col. 9]

The formula examines how much X and Y vary together and controls for the variance of X; it is covariation per unit variance of X.

CONSTANT

CONSTANT: Yavg - ( B * Xavg ). It is the point on the regression line where

X = 0. It gives you the Ypred at X = 0.

REGRESSION LINE

REGRESSION LINE (simple): The points on the line Ypred = a + b*X. The points on the regression line generate the smallest unexplained SUMS OF SQUARES (SSres), and therefore the largest SSpred under the assumption of linearity.

CORRELATION

CORRELATION: it takes the slope and standardizes it; it tells you what the slope is if each variable has a mean of 0 and a standard deviation of 1

b = rxy * (Sy/Sx) ==> b * Sx = rxy * Sy ==> b * (Sx/Sy) = rxy

 

b * ((Sx)/(Sy)) = rxy

 

0.206 * ((2283.67)/(1010)) = 0.466

R, R2, AND EXPLAINED VARIANCE.

R, R2, AND EXPLAINED VARIANCE. As a result of your simple regression you have decomposed each Y score into two portions: an explained or predicted portion, and a residual or unexplained portion. Across all cases you can add up your Y predicted scores (Yp) and get their variance. Remember each Ypi = a + bXi. You can also do likewise for each residual. Remember, each residual is defined as

Yres. i = Yi - Ypi. The ratio (VAR Yp / VAR Y) is the ratio of your explained Y variance to your total Y variance; this equals R2. The ratio (VAR Yres. / VAR Y) is the ratio of your unexplained Y variance to your total Y variance. This = ( 1 - R2 ).

TOTAL VARIANCE Y = VARIANCE UNEXPLAINED Y + VARIANCE PREDICTED Y. You can also state this same relationship in terms of sums of squared differences about the mean: TOTAL SS = UNEXPLAINED SS + PREDICTED SS