Index to Module Three Notes
|
Multiple regression gives us
the capability to add more than just numerical (also called quantitative)
independent variables. In these notes, we will examine dummy variables and
interaction. To illustrate these concepts, I want to introduce a new example (I
think I just heard some applause).
This example is near and dear to me: it involves a study of faculty pay. Here
is the data. Years of experience (Years) is the independent variable
hypothesized to predict Salary, the dependent variable.
Worksheet 3.2.1.
Years |
Salary |
13 |
72000 |
13 |
68000 |
10 |
66000 |
10 |
64000 |
14 |
64000 |
8 |
62000 |
15 |
61000 |
11 |
60000 |
9 |
60000 |
15 |
59000 |
5 |
59000 |
12 |
59000 |
11 |
58000 |
6 |
57000 |
7 |
56000 |
12 |
55000 |
6 |
55000 |
9 |
52000 |
14 |
51000 |
7 |
50000 |
3 |
45000 |
3 |
44000 |
4 |
44000 |
4 |
42000 |
8 |
41000 |
5 |
34000 |
2 |
34000 |
1 |
30000 |
2 |
25000 |
1 |
22000 |
Review
of Linear Relationships
Suppose I wish to test a
simple linear relationship between Salary and Years. The scatter diagram with
the predicted linear regression equation is shown in the Worksheet 3.2.2 Line
Fit Plot.
Worksheet 3.2.2
To determine if the linear regression has utility, I created the regression
summary shown in Worksheet 3.2.3 by using the Regression Add-In Data Analysis
Tool.
Worksheet 3.2.3
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
Multiple R |
0.78631551 |
|
|
|
|
|
|
0.61829208 |
|
|
|
|
|
Adjusted |
0.60465965 |
|
|
|
|
|
Standard Error |
8132.06260 |
|
|
|
|
|
Observations |
30 |
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
Regression |
1 |
2999314286 |
2999314286 |
45.354517 |
2.59764E-07 |
|
Residual |
28 |
1851652381 |
66130442.18 |
|
|
|
Total |
29 |
4850966667 |
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Intercept |
33119.04762 |
3124.438012 |
10.6000015 |
2.61857E-11 |
26718.9192 |
39519.175 |
Years |
2314.28571 |
343.6423654 |
6.73457625 |
2.59764E-07 |
1610.36544 |
3018.2059 |
|
|
|
|
|
|
|
The Summary Output provides the linear regression equation intercept and slope:
Eq. 3.2.1: Salary = 33119 + 2314 (Years)
The intercept of $33,119 is
the salary a faculty member would make with no experience (a new hire).
However, I can't be sure of this since we did not have any years of experience
equal to zero in the sample. The slope of $2,314 indicates that salary
increases $2,314 for every year of experience.
Recall that our test for practical utility looks at R Square (or R2)
and the Standard Error of the Model. The
To test for statistical utility, we set up the null and alternative hypotheses:
H0:
B1 = 0 (linear regression model is not statistically useful)
Ha: B1 =/= 0 (linear regression model is statistically
useful)
Since the p-value (2.598E-07)
for the t-Stat is less than alpha of 0.05, we reject the null hypothesis and
conclude that the model has statistical utility.
To check the normality assumption for the errors or residuals, I checked for
outliers and there were none in the standardized residual printout (I did not
include these in Worksheet 3.2.2). Next, I produced the residual plot to check
for the assumptions of constant error variance and independent errors. This is
reproduced in Worksheet 3.2.4.
Worksheet 3.2.4
Note that for low values of
Years, the observations are below the zero error line, then mostly above the
line (up to 10), and then start to go back below the line of zero error. This
pattern indicates that the last two assumptions are not satisfied and we may
get a better model for prediction if we added curvature. We'll skip curvature issues in this module
and instead evaluate the effect of adding another variable to improve this
model's precision as we have done in module 3.1. however, we now will consider a different kind
of variable.
Dummy
Variables
Multiple regression
gives us the capability to incorporate a very useful technique to stratify the
data in our attempt to build more reliable and accurate prediction models. We
stratify data by using dummy variables (also called categorical or qualitative
variables).
Recall that the Standard Error of the Estimate is $8,132 for the linear model
above. One reason that the error may be so high is that I am including both
male and female faculty member in the database. If there is salary
discrimination, combining both males and females would lead to high error. One
solution is to run two separate models - one for females and one for males.
Another option is to stratify or categorize males and females in one regression
model. The benefit of stratification in one regression model is that we picture
any differences in the line fit plots, and we can test for interaction which I
cover last in this note set.
The way we stratify data is to categorize it by using a dummy variable. In this
case, let's let X represent a variable called gender. It's values are:
X
= 1 if the faculty member is male
X = 0 if the faculty member is female
Which gender gets the 1 and which
gets the 0 is arbitrary, but only use 1 and 0 for the two categories (for
example, do not use 2 and 1). Note that by using 0, the intercept will now have
an interpretation. Worksheet 3.2.9 illustrates the data for a regression model
with one dependent variable, Salary, and one independent variable, gender.
Worksheet 3.2.9
Gender |
Salary |
1 |
72000 |
0 |
68000 |
1 |
66000 |
0 |
64000 |
1 |
64000 |
1 |
62000 |
1 |
61000 |
0 |
60000 |
1 |
60000 |
0 |
59000 |
1 |
59000 |
1 |
59000 |
1 |
58000 |
0 |
57000 |
1 |
56000 |
0 |
55000 |
1 |
55000 |
0 |
52000 |
0 |
51000 |
0 |
50000 |
1 |
45000 |
0 |
44000 |
1 |
44000 |
0 |
42000 |
0 |
41000 |
0 |
34000 |
1 |
34000 |
1 |
30000 |
0 |
25000 |
0 |
22000 |
The first faculty member is a male making $72,000 (must be in the
Worksheet 3.2.10 provides the regression Summary Output.
Worksheet 3.2.10
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
Multiple R |
0.26475648 |
|
|
|
|
|
|
0.07009599 |
|
|
|
|
|
Adjusted |
0.03688513 |
|
|
|
|
|
Standard Error |
12692.7050 |
|
|
|
|
|
Observations |
30 |
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
Regression |
1 |
340033333.3 |
340033333.3 |
2.1106349 |
0.15739443 |
|
Residual |
28 |
4510933333 |
161104761.9 |
|
|
|
Total |
29 |
4850966667 |
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Intercept |
48266.6666 |
3277.242356 |
14.7278295 |
1.03073E-14 |
41553.53248 |
54979.80085 |
Gender |
6733.33333 |
4634.720587 |
1.45280243 |
0.157394431 |
-2760.47207 |
16227.13875 |
|
|
|
|
|
|
|
The regression Summary Output provides the following regression equation:
Eq. 3.2.6: E(y) = b0 + b1x ; for this sample:
Salary = $48,267 + $6,733(Gender)
Recall that males were given the category 1 and females 0. So we can now make our point estimates of average male and female salaries:
Eq. 3.2.7: Salary = $48,267 + ($6,733 * 1) or $55,000 for males;
Salary = $48,267 + ($6,733 * 0) or $48,267 for females
The regression parameters
take on a different interpretation for dummy variables. The
"intercept" is the average salary made by females, and the
"slope" is the difference between the average salaries made by males
and females. Another difference with dummy variables is the line fit plot,
since X only takes on values of 0 and 1. The line fit plot will show the
cluster of female and male faculty actual observations at X = 0 and X = 1, and
the prediction (average value for Y when X = 1, and average value for Y when X
= 0) as points. Worksheet 3.2.11 illustrates the line fit plot.
Worksheet 3.2.11
Generally, we use a dummy variable as a stratification variable along with one
or more other independent variables. Indeed, you can see from the regression
summary output that the regression model with just gender and salary does not
have statistical or practical utility. We will combine the dummy variable with
the experience independent variable in the next section to see if our results
improve. I simply used the dummy variable by itself to introduce the concept.
Dummy variables are used to categorize data in models where there are
attributes such as in season/out of season, large/small, and defective/not
defective. You will be asked to incorporate a dummy variable in Assignment 3.
If the characteristic being modeled has more than two levels, we need to use
more than one dummy variable. For example, what if you wanted to model Fall,
Winter, Spring, Summer. Then, X1 could represent Fall and if the
observation is a Fall observation, it gets the value 1 in data entry, 0 otherwise;
X2 is used for Winter; and X3 is used for Spring. There
would be no dummy variable for Summer. Do you know why? Did I hear, "the
intercept becomes Summer"? That's right!
For Assignment 3, we will be adding just one dummy variable, so keep to two levels
(male/female, in season/out of season, large/small, etc.). The next modeling concept is interaction.
Interaction
When there are two
independent variables, the relationship between Y and X1 may depend
on X2: that dependency is called interaction. When the relationship
between Y and X1 does not depend on X2 we say there is no
interaction. I am going to illustrate the "no interaction" case
first, since I can use the data in the faculty salary example. I will introduce
a new data set to demonstrate "interaction" after that.
The hypothesized population equation to model interaction is:
Eq. 3.2.8: E(Y) = B0 + B1X1 + B2X2 + B3X1X2;
The cross-product term, X1X2,
is the interaction term, so B3 in Equation 3.2.8 is the slope of
interest for testing interaction.
To model interaction with sample data, we multiple the two independent
variables to make a new variable. Worksheet 3.2.12 illustrates multiplying the
contents of the cells in Column A (Years) with the cells in Column B (Gender)
to make a new variable, Yrs*Gndr.
Worksheet 3.2.12
Years |
Gender |
Yrs*Gndr |
Salary |
13 |
1 |
13 |
72000 |
13 |
0 |
0 |
68000 |
10 |
1 |
10 |
66000 |
10 |
0 |
0 |
64000 |
14 |
1 |
14 |
64000 |
8 |
1 |
8 |
62000 |
15 |
1 |
15 |
61000 |
11 |
0 |
0 |
60000 |
9 |
1 |
9 |
60000 |
15 |
0 |
0 |
59000 |
5 |
1 |
5 |
59000 |
12 |
1 |
12 |
59000 |
11 |
1 |
11 |
58000 |
6 |
0 |
0 |
57000 |
7 |
1 |
7 |
56000 |
12 |
0 |
0 |
55000 |
6 |
1 |
6 |
55000 |
9 |
0 |
0 |
52000 |
14 |
0 |
0 |
51000 |
7 |
0 |
0 |
50000 |
3 |
1 |
3 |
45000 |
3 |
0 |
0 |
44000 |
4 |
1 |
4 |
44000 |
4 |
0 |
0 |
42000 |
8 |
0 |
0 |
41000 |
5 |
0 |
0 |
34000 |
2 |
1 |
2 |
34000 |
1 |
1 |
1 |
30000 |
2 |
0 |
0 |
25000 |
1 |
0 |
0 |
22000 |
Worksheet 3.2.14 provides the regression Summary Output.
Worksheet 3.2.14
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
Multiple R |
0.83065732 |
|
|
|
|
|
|
0.68999158 |
|
|
|
|
|
Adjusted |
0.65422138 |
|
|
|
|
|
Standard Error |
7605.26254 |
|
|
|
|
|
Observations |
30 |
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
Regression |
3 |
3347126190 |
1115708730 |
19.289563 |
8.62451E-07 |
|
Residual |
26 |
1503840476 |
57840018.32 |
|
|
|
Total |
29 |
4850966667 |
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Intercept |
28809.5238 |
4132.381497 |
6.97165153 |
2.10919E-07 |
20315.28642 |
37303.76119 |
Years |
2432.14285 |
454.5013685 |
5.35123329 |
1.33311E-05 |
1497.901302 |
3366.384412 |
Gender |
8619.04761 |
5844.069958 |
1.47483648 |
0.15226382 |
-3393.618092 |
20631.71333 |
Yrs*Gndr |
-235.71428 |
642.7619995 |
-0.3667209 |
0.71679485 |
-1556.931363 |
1085.502792 |
The coefficients for the
multiple regression model with interaction are provided in the rows labeled Intercept,
Years, Gender and Yrs*Gndr. The equation based on this sample is:
Eq. 3.2.9: E(y) = b0 + b1x1 + b2x2 + b3x3 + b4x1x2; here
Salary
= 28809.5 + 2432.1 (Years) +
8619 (Gender) - 235.7 (Yrs*Gndr)
To determine if interaction is important, we can proceed directly to the hypothesis test for interaction:
H0:
B3 = 0 (interaction is not important)
Ha: B3 =/= 0 (interaction is important)
Since the p-value (0.717) for
the t-stat for the Yrs*Gndr term is greater than 0.01, we do not reject the
null hypothesis and conclude interaction is not important. The analyst should
then remove the interaction column of data and rerun the regression model
without interaction to see if either years or gender or both are important in
predicting salary. Worksheet 3.2.15 illustrates what "no interaction"
looks like:
Worksheet 3.2.15
This line fit plot illustrates the relationship between Salary and Years. But
note that there are two regression lines. The two regression lines represent
the stratification of the data - one line for males and one for females. Can
you guess which is the top line? Right - the male regression line is the top
line. We learned that when we examined the gender variable in the previous
section: males make more than females. There was, in fact, salary
discrimination at the university where this data sample came from.
But because the two regression lines have approximately the same slope, we can
predict the increase in salary for males or females based on time alone.
I realize that males make more than females - but it is a fixed amount because
the slopes are parallel. Thus, the formal definition of "no
interaction" is, for this problem: the relationship between Salary and
Years does not depend on gender. Recall that statistical relationships
between variables in regression is measured through the slope. The slope on
years is $2,432 for both males and females since the Years*Gender term goes
away (the interaction slope is equal to zero).
IF there was interaction, then the slopes would be different for males
and females. For an hypothetical example, what if the Equation 3.2.9
came out as follows:
Eq. 3.2.10: Salary = 28809.5 + 2432.1 (Years) + 8619 (Gender) +
1000 (Yrs*Gndr)
Now note what happens when we substitute 1 (for males) in the gender term:
Eq. 3.2.11: Male Salary = 28809.5 + 2432.1 (Years) + 8619 (1) +
1000 (1 * Years) = 37428.5 + 3432.1 (Years)
...and then 0 (for females) in the gender term
Eq. 3.2.12: Female Salary = 28809.5 + 2432.1 (Years) + 8619 (0)
+1000 (0 * Years) = 28809.5+ 2432.1 (Years)
In my hypothetical example,
the slope for the male regression line is 3432.1; meaning that male salaries
increase $3,432 per year. The slope for the female regression line is $2,432.1;
meaning that female salaries increase only $2,432.1 per year. Since the slopes
are different, we cannot predict the change in salary without knowing the
gender. Thus, we would say there is interaction: the relationship between
salary and years depends on whether the faculty member is male or
female.
The above illustration of interaction was hypothetical. There really was no
interaction but there really was discrimination - males made more on average
than females, but that amount was constant for all levels of experience (no
interaction). If the amount was not constant, there would have been
interaction).
Let's examine an actual model of interaction so you can an actual picture and
the equations for interaction. This example was an experiment done by an
over-the-counter drug manufacturer interested in testing a new drug designed to
be effective in curing an illness most commonly found in older patients. The
dependent variable is Effect, which stands for the effectiveness of recovery
from a certain illness. It is measured on a scale of 0 to 100 (100 is more
effective). The quantitative independent variable is age, and the qualitative
independent variable is whether or not the new drug was present (drug = 0) or
absent (subjects took the old drug) (drug = 1) in the experiment. Note that
half of the subjects were given the new drug, and half were given the old drug
(they were not told which one to avoid bias in the experiment. This would then
be called a blind experiment since the subjects did not know which drug
they were taking. An experiment is called a double blind experiment if
the drug administrator does not know which drug is being administered, as well.
Worksheet 3.2.16 shows the data entry.
Worksheet 3.2.16
Age |
Drug |
Age*Drug |
Effect |
21 |
1 |
21 |
56 |
19 |
0 |
0 |
28 |
28 |
1 |
28 |
55 |
23 |
0 |
0 |
25 |
67 |
0 |
0 |
71 |
33 |
1 |
33 |
63 |
33 |
1 |
33 |
52 |
56 |
0 |
0 |
62 |
45 |
0 |
0 |
50 |
38 |
1 |
38 |
58 |
37 |
0 |
0 |
46 |
27 |
0 |
0 |
34 |
43 |
1 |
43 |
65 |
47 |
0 |
0 |
59 |
48 |
1 |
48 |
64 |
53 |
1 |
53 |
61 |
29 |
0 |
0 |
36 |
53 |
1 |
53 |
69 |
58 |
1 |
58 |
73 |
63 |
1 |
63 |
62 |
59 |
0 |
0 |
71 |
51 |
0 |
0 |
62 |
67 |
1 |
67 |
70 |
63 |
0 |
0 |
71 |
Worksheet 3.2.17 provides the regression summary.
Worksheet 3.2.17
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
Multiple R |
0.9655163 |
|
|
|
|
|
|
0.9322218 |
|
|
|
|
|
Adjusted |
0.9220551 |
|
|
|
|
|
Standard Error |
3.8772477 |
|
|
|
|
|
Observations |
24 |
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
Regression |
3 |
4135.297333 |
1378.43244 |
91.6934647 |
7.33335E-12 |
|
Residual |
20 |
300.6610008 |
15.0330500 |
|
|
|
Total |
23 |
4435.958333 |
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Intercept |
6.2113811 |
3.308965648 |
1.8771368 |
0.07516468 |
-0.69099698 |
13.1137593 |
Age |
1.0333908 |
0.071447502 |
14.463639 |
4.70078E-12 |
0.884354064 |
1.18242767 |
Drug |
41.304210 |
5.022786373 |
8.2233658 |
7.5975E-08 |
30.82686622 |
51.7815540 |
Age*Drug |
-0.702883 |
0.10763568 |
-6.530210 |
2.30245E-06 |
-0.92740760 |
-0.47835962 |
|
|
|
|
|
|
|
The Intercept, Age, Drug
and Age*Drug regression coefficients provide the equation from this sample:
Eq. 3.2.13: Effect = 6.2+ 1.03 (Age)+ 41.3 (Drug) - 0.7 (Age*Drug)
Let's look at the equation for subjects who got the old drug (did not get the new drug) (drug = 1):
Eq. 3.2.14: Effect = 6.2 + 1.03 (Age) + 41.3 (1) - 0.7 (Age*1)
Effect = 47.5 + 0.33 (Age)
Now, look at the equation for subjects who got the new drug (drug = 0):
Eq. 3.2.15: Effect = 6.2 + 1.03 (Age) + 41.3 (0) - 0.7 (Age*0)
Effect = 6.2 + 1.03 (Age)
Subjects who got the new drug
would enjoy a nearly a 1 unit increase in drug healing effectiveness for each
year of age. On the other hand, subjects who got the old drug only enjoy a .33
unit increase in effectiveness for each year of age. The drug manufacturer
would be excited about this finding - the new drug is more effective as one
ages. Statistically, we say there is interaction. The relationship between
effectiveness and age depends on whether or not the subject is taking the drug.
The slopes of the regression equation change when there is interaction.
Worksheet 3.2.18
In the Worksheet 3.2.18 Line Fit Plot, note the steeper sloping line that
begins with the observation of a subject aged 20, with an effectiveness score
of just below 30. This is the line of Equation 3.2.15 - the one with the
steeper slope that shows more effectiveness with age than line of Equation
3.2.14. The drug manufacturer can then tell that for subjects over 60, there
should be more effectiveness in healing with the new drug as patients age. I
will use this example again in Module Notes 3.2.3 when we discuss model
building. I will go over statistical and practical utility, testing assumptions
and prediction with this model at that time. This illustration was simply used
to show a picture of interaction.
That's it! We now talked about the separate components of multiple regression.
Adding a quantitative independent variable, adding a qualitative (dummy)
independent variables, and adding an interaction component. There can be more
complications - except adding curvature, adding several more quantitative
variables and studying three-way interactions - but you have the basic building
blocks of multiple regression involving a quantitative dependent variable. Note
that because of the principle of parsimony, I would suggest caution before
building a model that has so many terms it is impossible to interpret.
There are model extensions for qualitative dependent variables but they go
beyond the scope of this course. All
that remains for our study of multiple regression is a system for knowing how
to put the separate components together in a single model. That is the subject
of Module Notes 3.3.
References:
Anderson,
D., Sweeney, D., & Williams, T. (2007). Contemporary Business Statistics
with Microsoft Excel.
D.
Groebner, P. Shannon, P. Fry & K. Smith.
Business Statistics: A Decision Making Approach, Seventh Edition,
Prentice Hall, Chapter 15.
Ken Black. Business Statistics for
Contemporary Decision Making. Fourth Edition, Wiley. Chapter 13,
14 & 15 (Advanced chapter: 16).
Levine,
D., Berenson, M. & Stephan, D. (1999). Statistics for Managers Using
Microsoft Excel (2nd. ed.).
Mason,
R., Lind, D. & Marchal, W. (1999). Statistical Techniques in Business
and Economics (10th. ed.).
| Return to Module Overview | Return to top of page |
|