Module 3.1: Introduction to Multiple Regression

Module 3.1 Notes
"Introduction to Multiple Regression"

Index to Module Three Notes
3.1: Introduction to Multiple Regression

3.2: Curvature, Dummy Variables and Interaction

3.3: Model Building Process

Multiple regression and correlation analysis is similar to simple linear regression and correlation in that it involves the study of the form, direction and strength of relationships. But with multiple regression, the relationship is between the dependent variable and multiple independent variables rather than just one independent variable.

The seven steps we used to study simple linear regression in Module 2 will be used for multiple regression. They are repeated here for convenience.

1. Hypothesize regression model
E(Y) = B₀ + B₁ X₁ + B₂X₁² + B₃X₂ + B₄X₁X₂

2. Gather data and examine relationships with graphs (scatter diagrams)
3. Estimate regression model parameters
4. Test practical utility of regression model
5. Test statistical utility of regression model and multiple independent terms
6. Evaluation assumptions of regression model
7. Use model for prediction

Note that the first step shows the addition of another independent variable, X₂. That variable could be another numerical variable or a categorical variable used to stratify the data. Examples of categorical or qualitative variables include male/female, in-season/out of season, and defective/not defective. The first step also shows that X₁ has been raised to the 2^ndpower. This let's us examine upward or downward curvature in the data. The final term that has been added is a mathematical way of capturing a concept called interaction. Interaction allows us to model situations where the relationship between Y and X₁ is said to depend on X₂.

The objective of Module 3 is to introduce model building skills that will enable you to model relationships involving multiple numerical and categorical variables, curvature and interaction. Module 3.2 Notes will cover incorporating categorical variables, curvature and interaction separately. Model 3.3 Notes will cover model-building skills that will hopefully put everything together.

The purpose of Module 3.1 Notes is to introduce multiple regression by the addition of just one more numerical independent variable. Module Notes 3.1, 3.2 and 3.3 are designed to prepare you for Assignment 3. These note sets relate to optional readings in Chapter 14 of the Levine reference and Chapter 13 in the Mason reference given at the end of the note sets.

Step 1: Hypothesize the regression model

Equation 3.1.1 illustrates a model similar to that used in Module 2; a model to predict the external hours of time needed to audit a client based on client assets and the years of experience of the audit leader (abbreviated Exp in this illustration). My idea, or hypothesis, is that as client assets increase, external hours increase at a constant rate (positive linear relationship); and as the years of experience of the audit leader increase, external hours decrease at a constant rate (negative linear relationship).

Eq. 3.1.1: E(Y) = B₀ + B₁X₁ - B₂X₂; or
E(ExtHours) = B₀ + B₁(Assets) - B₂(Exp)

Caution

When we add a second (or many) independent numerical variable(s), we want to add independent variables that correlate strongly with the dependent variable, but weakly with the other independent variables already in the model. As a rule of thumb, I like to make sure the correlation coefficient between any of the pairs of independent variables is less than +0.7 and greater than -0.7. If two or more independent variables are strongly related to each other, we get what's called multicollinearity. This condition makes it difficult to determine the contribution of the individual independent variables, and also adversely affects model interpretation.

Step 2: Gather Data

When you are exploring whether or not independent variables will contribute to effective predictive models, you may wish to run separate scatter diagrams between Y and each of the independent variables, and determine the correlation between Y and each of the independent variables. Here is the data I have selected for this example. Assets are in $000s, as in Module 2; Exp (Experience of the audit team leader) is in years; and ExtHours (external hours to perform the audit) are in hours. Assets and Experience are the independent variables being used to predict External Hours, the dependent variable.

Worksheet 3.1.1

Assets

Exp

ExtHours

3200

8

700

3000

7

900

3500

9

800

4000

6

900

4700

8

880

5000

6

850

6000

5

1000

5500

6

1000

6500

7

950

6500

4

1100

7000

5

1200

4500

9

830

7500

3

1100

5750

6

1100

7250

5

1160

8000

8

1120

8000

4

1300

8500

4

1400

9000

7

1500

8500

2

1200

Next, I ran the scatter diagrams relating External Hours to Assets (Worksheet 3.1.2), and External Hours to Experience (Worksheet 3.1.3).

Worksheet 3.1.2

Worksheet 3.1.3

Worksheet 3.1.2 shows the moderately strong positive linear relationship between External Hours and Assets. Worksheet 3.1.3 does show a negative linear relationship, but not as strong as we would like. Remember that we use the correlation coefficient, r, to measure strength of the linear relationship between two variables. While we can take the square root of the coefficients of determination, there is an Excel Data Analysis Add-In that lets us create all of the pair-wise correlation coefficients at once. Worksheet 3.3.4 shows the result. To get the correlation matrix, select Tools from the Standard Toolbar, Data Analysis from the pulldown menu, Correlation, then respond to the dialog box entry questions. If the data in Worksheet 3.1.1 above started in Column A, Row 1, and extended through Column C, Row 21, the input range for the first dialog box is A1:C21. I remember to check the Labels circle, then place the output near my data, such as D1. Click OK and Worksheet 3.1.4 shows the results:

Worksheet 3.1.4

Assets

Exp

ExtHours

Assets

1

Exp

-0.605951

1

ExtHours

0.90407

-0.565325

1

Note the less than moderate relationship between the two independent variables, Assets and Experience (-0.606): that's good. But also note the fairly weak relationship between the dependent variable, External Hours, and Experience (-0.565): that's not so good if we want to improve our prediction of External Hours.

Model builders like to use the principle of parsimony or keeping models as simple as possible (I've heard this called kis - keep it simple, but prefer the more academic term!). When we build multiple regression models, we do run the risk of making something too complicated to interpret and understand, especially when the addition of a variable does not contribute to the prediction of the dependent variable. I want to continue with the addition of a fairly weak variable to illustrate this point.

Pause and Reflect

We are introducing a subject that involves building bigger models than the simple linear regression model. However, we need to be careful that we use good science in building these bigger models. Just because a model is bigger doesn't make it better - sometimes it's just bigger.

Step 3: Estimate the Regression Parameters

To estimate the regression parameters and generate all of the regression statistics and graphs, use the Regression Data Analysis Add In as we did in Module 2. Select Tools from the Standard Toolbar, Data Analysis from the pulldown menu, Regression, and then respond to the dialog questions as you did in building a simple linear regression model. The difference is that the X range will include highlighting all of the columns containing the X variables rather than just one column. Caution: when you load data, be sure to put all of the X variables in contiguous columns to allow their selection at one time (Note Worksheet 3.1.1 shows the two independent variables in adjacent columns). Worksheet 3.1.5 illustrates the Regression Summary.

Worksheet 3.1.5

SUMMARY OUTPUT

Regression Statistics

Multiple R

0.904337925

R Square

0.817827082

Adjusted R Square

0.796394974

Standard Error

93.69957022

Observations

20

ANOVA

Df

SS

MS

F

Significance F

Regression

2

670041.6392

335020.8196

38.15897

5.18E-07

Residual

17

149253.3608

8779.609459

Total

19

819295

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

468.7842299

153.5922957

3.052133753

0.007206

144.7324

792.8361

Assets

0.098138845

0.014392674

6.818666744

2.99E-06

0.067773

0.128505

Exp

-2.931174966

13.79061295

-0.212548563

0.834207

-32.02687

26.16452

RESIDUAL OUTPUT

PROBABILITY OUTPUT

Observation

Predicted ExtHours

Residuals

Standard Residuals

Percentile

ExtHours

1

759.3791346

-59.37913462

-0.669959585

2.5

700

2

742.6825406

157.3174594

1.774972649

7.5

800

3

785.8896132

14.1103868

0.159203884

12.5

830

4

843.7525607

56.24743933

0.634625468

17.5

850

5

906.5874023

-26.58740234

-0.299978859

22.5

880

6

941.8914058

-91.89140582

-1.036787223

27.5

900

7

1042.961426

-42.96142593

-0.484722778

32.5

900

8

990.9608284

9.039171611

0.101986661

37.5

950

9

1086.168499

-136.1684986

-1.536354334

42.5

1000

10

1094.962023

5.037976531

0.0568422

47.5

1000

11

1141.100271

58.89972892

0.664550573

52.5

1100

12

884.0284583

-54.02845834

-0.609589273

57.5

1100

13

1196.032044

-96.03204358

-1.08350498

62.5

1100

14

1015.49554

84.50446032

0.953442207

67.5

1120

15

1165.634982

-5.634982363

-0.063578064

72.5

1160

16

1230.445591

-110.4455913

-1.246129352

77.5

1200

17

1242.170291

57.82970881

0.65247781

82.5

1200

18

1291.239714

108.7602862

1.227114486

87.5

1300

19

1331.515611

168.4843886

1.900966253

92.5

1400

20

1297.102064

-97.1020637

-1.095577743

97.5

1500

Look in the intercept, Assets and Experience Rows, in the Coefficients Column, to obtain the regression coefficients. These are repeated in Equation 3.1.2.

Eq. 3.1.2: ExtHours = 468.8 + 0.10 (Assets) - 2.9 (Experience)

The intercept, 468.8, is interpreted as in simple linear regression. External hours equal 468.8 when assets and experience equal to zero. Since we did not have any observations where assets and experience equal zero, there would be no practical interpretation.

The slope of 0.10 for assets means that external hours increase by 0.10 when assets increase by one ($1,000 since assets are in $000s), holding experience constant (that is, for interpretations involving team leaders with similar experience). The slope of -2.9 means that external hours decrease by 2.9 when experience increases by 1 year, holding assets constant (at similar levels).

Step 4: Test Practical Utility of Regression Model

As with simple linear regression, we use R² and Standard Error of the Model to test practical utility of the model.

Coefficient of Multiple Determination, Adjusted R²The R² in multiple regression is called the coefficient of multiple determination. It is safest to use the Adjusted R²(called Adjusted R Square in Excel), as this statistic adjusts for the number of independent variables in relation to sample size to prevent overestimating the amount of variability explained by the independent variables. A rule of thumb in multiple regression is that we should have at least 10 observations for each regression parameter. This model has three regression parameters (the intercept B₀ and the two slopes, B₁ and B₂) so we should have 30 observations. We only have 20 so the adjusted R² is a bit lower than the R² - sort of like keeping us honest. The formula used by Excel to compute the adjusted R² is:

Eq. 3.1.3: R²_adjusted = 1 - { ( 1 - R²) * [ (n - 1) / (n - 2) ] }

From this formula, you can see that if the sample size was 1,000, the difference between adjusted R Square and R Square is negligible.

The adjusted R Square is found in the Regression Statistics near the top of Worksheet 3.1.5. It is 0.796. The interpretation: assets and experience explain 79.6% of the sample variation in external hours. As in simple linear regression, we would like this to be as close to 100% as possible. A rule of thumb benchmark might be at least above 50%. If you are comparing two models, the model that is most practically useful, if everything else is equal, is the model with the highest adjusted R².

Standard Error of the Model
The Standard Error of the Model or Estimate is found directly below the adjusted R². It is 93.7. The interpretation: we would expect 95% (or most) of the actual values of external hours to be within +/- 2 * 93.7 of the predicted external hours. As with simple linear regression, we want the standard error to be as low as possible. A benchmark might be the standard error (93.7) divided by an actual values of Y (external hours) should be lower than ranges between .05 and 0.15, or 5% to 15%. If you are given two models, the model with the lowest standard error, if everything else is equal, is the model with the most practical utility.

Pause and Reflect

Although testing practical utility involves benchmarks and judgment, it is very important. Models that are not judged practically useful should not be used, even if they are proven to be statistically significant or useful. A model could have a very steep slope and thus be statistically significant, but it might have too much error around the regression line to be of practical use.

Step 4: Test Statistical Utility

There are multiple hypothesis tests in multiple regression because not only can we test for statistical utility of the model, but we can also test for statistical utility of the multiple independent variables.

Is the Model Statistically Significant or Useful?
The null and alternative hypotheses to test for model statistical utility are:

H₀: B₁ = B₂ = 0 (the model is not statistically useful)
H_a: At least 1 B is =/= 0 (the model is statistically useful)
(remember, =/= is the closest I can come to the "not equal" symbol)

The test statistic for this test of hypothesis is the F, which has a value of 38.16 (look in the Regression row of the ANOVA table in Worksheet 3.1.5, under the F Column). Recall, in concept, the F measures the ratio of variation explained by the regression model to the variation unexplained. The farther this number is from 1, the more variation is explained related to unexplained. The p-value for the F is 5.28E-07, or 0.000000528 (look in the Regression row of the ANOVA table in Worksheet 3.1.5,under the Significance F column). Since the p-value of 5.28E-07 is less than an alpha of 0.01, we reject the null hypothesis and conclude the model is statistically useful.

Attention to Detail

Did you notice that I used an alpha of 0.01, whereas I have been using convention of an alpha of 0.05 before now? We are entering into a new hypothesis test procedure in multiple regression where we do multiple hypothesis tests on the same data (with this example, we could actually do three tests: a model test, a test on X₁, and a test on X₂). Each test with the same data theoretically increases the error of rejecting a true null hypothesis. So, research convention suggests that we lower the alpha to 0.01 when we do multiple tests with the same data.

Are Assets Statistically Significant or Useful?
We can also test for statistical significance of the individual independent variables. The null and alternative hypotheses to test for the importance of assets are:

H₀: B₁ = 0 (assets are not important or not statistically useful)
H_a: B₁ =/= 0 (assets are important or statistically useful)

The test statistic is the t, which has a value of 6.82. This is found in the Asset row of the regression statistics in Worksheet 3.1.5. It's value implies that the slope of 0.098 or 0.10 is 6.82 standard errors of the slope from zero. The associated p-value is 2.99E-06. Since the p-value is less than an alpha value of 0.01, reject the null hypothesis and conclude that assets are important in predicting the value of external hours.

Is Experience Statistically Significant or Useful?
We can similarly test for the statistical significance of the experience variable.

H₀: B₂ = 0 (experience is not important or not statistically useful)
H_a: B₂ =/= 0 (experience is important or statistically useful)

The t statistic for this test is found in the Experience row of the regression summary. Its value is -0.21, with an associated p-value of 0.834. Since the p-value is greater than an alpha of 0.01, we fail to reject the null hypothesis and conclude that experience is not useful in predicting external hours. We added a variable that has no statistical utility and it should be removed from the model, at least in its current form. Another variable might be considered, or the analyst may just decide to go with assets for predicting external hours.

Caution

When I say that Experience is not of statistical utility or important in helping to predict External Hours, we must realize that we are only testing experience in it's linear form at this point. There may be a slight chance that experience is important in a curvilinear relationship with external hours. We will talk more on this in Module 3.2 Notes.

A last point before we go on to Step 6. Note in the Experience row of Worksheet 3.1.5 that the Lower 95% and Upper 95% Confidence Levels are -32.03 and 26.16 for the slope coefficient of -2.93. The interpretation: we are 95% confident that when experience increases by 1 year, the change in predicted external hours is between a decrease of 32 hours and an increase of 26 hours. Since sometimes external hours decrease, and sometimes external hours increase when experience increases by one, knowledge of experience does not help in predicting external hours. The 95% confidence interval on the slope confirms why we did not reject the null hypothesis of no statistical relationship between experience and external hours.

Step 6: Evaluate Assumptions of Regression Model

The assumptions and their evaluation are the same as in simple linear regression. The error or residual term should be normally distributed with mean of zero; the error should be constant for all values of the independent variables; and the error should be independent.

We evaluate the first assumption by examining the standardized residuals to see if any error terms are more than three standard errors from the regression line. The normal probability plot can also be used. The idea is that this assumption is generally met unless there are extreme values that would cause a skew in the data. In the Residual Output section of Worksheet 3.1.5., in the column "Standard Residuals," we find no value greater than +3 or less than -3. Thus, we would declare that this assumption is met.

The second and third assumptions are evaluated by looking at the residual plots. Worksheet 3.1.6 illustrates one of the two residual plots (one residual plot for each independent variable will be automatically produced by the regression add-in to Excel). This plot shows that the error terms do not exhibit a distinct pattern and are fairly constant in their departure from zero for the various values of the independent variable. Both residual plots would have to be so evaluated. As long as there is fairly constant departure from zero for all values of the independent variable, and as long as there is no distinct pattern such as a cone or football or curve, the second and third assumptions are met.

Worksheet 3.1.6

Step 7: Use Model for Prediction

If the multiple regression model proves to be statistically and practically useful, and if the assumptions are met, it can then be used for prediction with confidence. I recognize that this model had a problem in that the second independent variable, experience, was determined to be of no statistical utility. The analyst would be better off building a new model without the experience variable, perhaps using some other variable or simply use assets as the single independent variable.

However, I would like to go ahead and make a prediction just to demonstrate the step. As with simple linear regression, we begin by making a point estimate prediction. Let's make our prediction for the first observation in the data set. This prediction involves values of the independent variables as follows. The value for assets is $3,200,000 and the value for audit team leader experience is 8 years. The point estimate is then:

Eq. 3.1.4: ExtHours = 468.8 + 0.10 (3200) - 2.9 (8) = 765.6

The second step is to make the prediction interval for this observation. We are 95% confident that for a client with $3,200,000 in assets, and with an audit team leader having 8 years of experience, the audit will take:

Eq. 3.1.5: ExtHours = 765.6 + 2 * standard error of model
ExtHours = 765.6 + 2 * (93.7)
ExtHours = 765.6 + 187.4 or 578.2 to 953.

As a closing note, the actual audit took 700 External Hours for the first observation, which gives an error of -65.6. That error is 9.4% of the actual value.

Excel will produce line fit plots showing the predicted values of Y for all the values of X₁ and X₂, but produces these as separate two-dimensional plots. It does not have the capability to illustrate the multidimensional surface represented by the multiple regression prediction model.

That ends the introduction. Next, we examine how to incorporate curvature, categorical variables and interaction into the multiple regression model.

Reference:

Anderson, D., Sweeney, D., & Williams, T. (2001). Contemporary Business Statistics with Microsoft Excel. Cincinnati, OH: South-Western, Chapter 12 (Section 12.9).

| Return to Module Overview | Return to top of page |

About the Course Module Schedule WebBoard

Assets	Exp	ExtHours
3200	8	700
3000	7	900
3500	9	800
4000	6	900
4700	8	880
5000	6	850
6000	5	1000
5500	6	1000
6500	7	950
6500	4	1100
7000	5	1200
4500	9	830
7500	3	1100
5750	6	1100
7250	5	1160
8000	8	1120
8000	4	1300
8500	4	1400
9000	7	1500
8500	2	1200

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.904337925
R Square	0.817827082
Adjusted R Square	0.796394974
Standard Error	93.69957022
Observations	20

ANOVA
	Df	SS	MS	F	Significance F
Regression	2	670041.6392	335020.8196	38.15897	5.18E-07
Residual	17	149253.3608	8779.609459
Total	19	819295

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	468.7842299	153.5922957	3.052133753	0.007206	144.7324	792.8361
Assets	0.098138845	0.014392674	6.818666744	2.99E-06	0.067773	0.128505
Exp	-2.931174966	13.79061295	-0.212548563	0.834207	-32.02687	26.16452



RESIDUAL OUTPUT					PROBABILITY OUTPUT

Observation	Predicted ExtHours	Residuals	Standard Residuals		Percentile	ExtHours
1	759.3791346	-59.37913462	-0.669959585		2.5	700
2	742.6825406	157.3174594	1.774972649		7.5	800
3	785.8896132	14.1103868	0.159203884		12.5	830
4	843.7525607	56.24743933	0.634625468		17.5	850
5	906.5874023	-26.58740234	-0.299978859		22.5	880
6	941.8914058	-91.89140582	-1.036787223		27.5	900
7	1042.961426	-42.96142593	-0.484722778		32.5	900
8	990.9608284	9.039171611	0.101986661		37.5	950
9	1086.168499	-136.1684986	-1.536354334		42.5	1000
10	1094.962023	5.037976531	0.0568422		47.5	1000
11	1141.100271	58.89972892	0.664550573		52.5	1100
12	884.0284583	-54.02845834	-0.609589273		57.5	1100
13	1196.032044	-96.03204358	-1.08350498		62.5	1100
14	1015.49554	84.50446032	0.953442207		67.5	1120
15	1165.634982	-5.634982363	-0.063578064		72.5	1160
16	1230.445591	-110.4455913	-1.246129352		77.5	1200
17	1242.170291	57.82970881	0.65247781		82.5	1200
18	1291.239714	108.7602862	1.227114486		87.5	1300
19	1331.515611	168.4843886	1.900966253		92.5	1400
20	1297.102064	-97.1020637	-1.095577743		97.5	1500