Module 3.1: Introduction to Multiple Regression

Module 3.1 Notes
"Introduction to Multiple Regression"

Index to Module Three Notes

3.1: Introduction to Multiple Regression

Multiple regression and correlation analysis is similar to simple linear regression and correlation in that it involves the study of the form, direction and strength of relationships. But with multiple regression, the relationship is between the dependent variable (Y) and multiple independent variables X₁, X₂, X₃,… X_n, that help us to better explain or predict the dependent variable (Y). The analysis is similar to simple linear regression but adding new variables lead to additional considerations as well. The general multiple regression equation with n independent variables is given by:

E(Y) = B₀ + B₁ X₁ + B₂X₂ + B₃X₃+ …+ B_n X_n where:

B₀ is the intercept, the value of Y when all X's are zero

B_i is the amount by which Y changes when a particular X_i increases by one unit with all other values held constant. The subscript i can assume values between 1 and n, which is the number of independent variables

For example, we could hypothesize that the fuel efficiency of a vehicle (variable Y) as measured by miles per gallon (MPG) is related to the octane rating of the gasoline being used (X₁) and to its horsepower (X₂). Therefore we can write a general multiple regression equation relating the independent variables with the dependent variable as:

E(Y) = B₀ + B₁ X₁ + B₂ X₂ ….or:

E(MPG) = B₀+ B₁ (Octane) + B₂ (HorsePower)

We could also hypothesize that curb weight is related to the vehicle's performance and we may want to include this additional independent variable (X₃) in the multiple regression equation:

E(Y) = B₀ + B₁ X₁ + B₂ X₂ + B₃ X₃….or:

E(MPG) = B₀+ B₁ (Octane) + B₂ (HorsePower) + B₃ (CurbWeight)

As in Module 2, we will use the seven steps to estimate regression parameters, test the practical and statistical utility of the estimated model, evaluate assumptions and use the model for prediction. The values for the coefficients in the multiple regression equation are found by using the least squares method. Recall that this method makes the sum of the squared differences between the predicted and the actual values of Y as small as possible. Because the computations are intensive, we'll rely on Excel to generate the b₀, b₁, b₂, and b₃ - which will be used as the estimates of regression parameters B₀, B₁, B₂, and B₃; statistical tests, and related measures of model's adequacy.

The seven steps we used to study simple linear regression in Module 2 will be applied for multiple regression as well. They are repeated here for convenience with a particular form of a model which will be reviewed in this module 3:

1. Hypothesize regression model

E(Y) = B₀ + B₁ X₁ + B₂X₂ + B₃X₁X₂

2. Gather data and examine relationships with graphs (scatter diagrams)
3. Estimate regression model parameters
4. Test practical utility of regression model
5. Test statistical utility of regression model and multiple independent terms
6. Evaluation assumptions of regression model
7. Use model for prediction

Note that the first step shows the addition of another independent variable, X₂. That variable could be another numerical variable or a categorical variable used to stratify the data. Examples of categorical or qualitative variables include male/female, in-season/out of season, and defective/not defective. The final term that has been added is a mathematical way of capturing a concept called interaction. Interaction allows us to model situations where the relationship between Y and X₁ is said to depend on X₂.

The objective of Module 3 is to introduce model building skills that will enable you to model relationships involving multiple numerical and categorical variables. Module 3.2 Notes will cover incorporating categorical variables and interaction separately. Model 3.3 Notes will cover model-building skills that will hopefully put everything together.

The purpose of Module 3.1 Notes is to introduce multiple regression by the addition of just one more numerical independent variable. Module Notes 3.1, 3.2 and 3.3 are designed to prepare you for Assignment 3. These note sets relate to optional readings in Chapter 13 of the Anderson, Sweeney & Williams reference given at the end of the note sets. Let's go through the 7 steps applied to multiple regression with the following example where we added a second independent variable representing the experience of the auditor as measured by his/her years of experience. In general, we add variable in a model to increase its explanatory power as measure by R². If you recall, R²in Module 2 is 0.82, meaning that approximately 82% of the sample variation in variable ExtHours is explained by variable Assets.

Step 1: Hypothesize the regression model

Equation 3.1.1 illustrates a model similar to that used in Module 2; a model to predict the external hours of time needed to audit a client based on client assets and the years of experience of the audit leader (abbreviated Exp in this illustration). My idea, or hypothesis, is that as client assets increase, external hours increase at a constant rate (positive linear relationship); and as the years of experience of the audit leader increase, external hours decrease at a constant rate (negative linear relationship).

Eq. 3.1.1: E(Y) = B₀ + B₁X₁ - B₂X₂; or

E(ExtHours) = B₀ + B₁(Assets) - B₂(Exp)

Caution

When we add a second (or many) independent numerical variable(s), we want to add independent variables that correlate strongly with the dependent variable, but weakly with the other independent variables already in the model. As a rule of thumb, I like to make sure the correlation coefficient between any of the pairs of independent variables is less than +0.7 and greater than -0.7. If two or more independent variables are strongly related to each other, we get what's called multicollinearity. This condition makes it difficult to determine the contribution of the individual independent variables, and also adversely affects model interpretation.

Step 2: Gather Data

When you are exploring whether or not independent variables will contribute to effective predictive models, you may wish to run separate scatter diagrams between Y and each of the independent variables, and determine the correlation between Y and each of the independent variables. Here is the data I have selected for this example. Assets are in $000s, as in Module 2; Exp (Experience of the audit team leader) is in years; and ExtHours (external hours to perform the audit) are in hours. Assets and Experience are the independent variables being used to predict External Hours, the dependent variable.

Worksheet 3.1.1

Assets	Exp	ExtHours
3200	8	700
3000	7	900
3500	9	800
4000	6	900
4700	8	880
5000	6	850
6000	5	1000
5500	6	1000
6500	7	950
6500	4	1100
7000	5	1200
4500	9	830
7500	3	1100
5750	6	1100
7250	5	1160
8000	8	1120
8000	4	1300
8500	4	1400
9000	7	1500
8500	2	1200

Next, I ran the scatter diagrams relating External Hours to Assets (Worksheet 3.1.2), and External Hours to Experience (Worksheet 3.1.3).

Worksheet 3.1.2

Worksheet 3.1.3

Worksheet 3.1.2 shows the moderately strong positive linear relationship between External Hours and Assets. Worksheet 3.1.3 does show a negative linear relationship, but not as strong as we would like. Remember that we use the correlation coefficient, r, to measure strength of the linear relationship between two variables. While we can take the square root of the coefficients of determination, there is an Excel Data Analysis Add-In that lets us create all of the pair-wise correlation coefficients at once. Worksheet 3.3.4 shows the result. To get the correlation matrix, select Tools from the Standard Toolbar, Data Analysis from the pulldown menu, Correlation, (in Excel 2007, select Data tab, then Data Analysis option, then Correlation from the dialogue box), then respond to the dialog box entry questions. If the data in Worksheet 3.1.1 above started in Column A, Row 1, and extended through Column C, Row 21, the input range for the first dialog box is A1:C21. I remember to check the Labels circle, then place the output near my data, such as D1. Click OK and Worksheet 3.1.4 shows the results:

Worksheet 3.1.4


	Assets	Exp	ExtHours
Assets	1
Exp	-0.605951	1
ExtHours	0.90407	-0.565325	1

Note the less than moderate relationship between the two independent variables, Assets and Experience (-0.606): that's good. But also note the fairly weak relationship between the dependent variable, External Hours, and Experience (-0.565): that's not so good if we want to improve our prediction of External Hours.

Model builders like to use the principle of parsimony or keeping models as simple as possible (I've heard this called kis - keep it simple, but prefer the more academic term!). When we build multiple regression models, we do run the risk of making something too complicated to interpret and understand, especially when the addition of a variable does not contribute to the prediction of the dependent variable. I want to continue with the addition of a fairly weak variable to illustrate this point.

Pause and Reflect

We are introducing a subject that involves building bigger models than the simple linear regression model. However, we need to be careful that we use good science in building these bigger models. Just because a model is bigger doesn't make it better - sometimes it's just bigger.

Step 3: Estimate the Regression Parameters

To estimate the regression parameters and generate all of the regression statistics and graphs, use the Regression Data Analysis Add In as we did in Module 2. Select Tools from the Standard Toolbar, Data Analysis from the pulldown menu, Regression, and then respond to the dialog questions as you did in building a simple linear regression model. The difference is that the X range will include highlighting all of the columns containing the X variables rather than just one column. Caution: when you load data, be sure to put all of the X variables in contiguous columns to allow their selection at one time (Note Worksheet 3.1.1 shows the two independent variables in adjacent columns). Worksheet 3.1.5 illustrates the Regression Summary.

Worksheet 3.1.5

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.9043379
R Square	0.8178270
Adjusted R Square	0.7963949
Standard Error	93.699570
Observations	20

ANOVA
	Df	SS	MS	F	Significance F
Regression	2	670041.6392	335020.8196	38.15897	5.18E-07
Residual	17	149253.3608	8779.609459
Total	19	819295

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	468.78422	153.59229	3.0521337	0.007206	144.7324	792.8361
Assets	0.0981388	0.0143926	6.8186667	2.99E-06	0.067773	0.128505
Exp	-2.931174	13.790612	-0.2125485	0.834207	-32.02687	26.16452



RESIDUAL OUTPUT					PROBABILITY OUTPUT

Observation	Predicted ExtHours	Residuals	Standard Residuals		Percentile	ExtHours
1	759.3791346	-59.379134	-0.6699595		2.5	700
2	742.6825406	157.31745	1.7749726		7.5	800
3	785.8896132	14.1103868	0.1592038		12.5	830
4	843.7525607	56.247439	0.6346254		17.5	850
5	906.5874023	-26.587402	-0.2999788		22.5	880
6	941.8914058	-91.891405	-1.0367872		27.5	900
7	1042.961426	-42.961425	-0.4847227		32.5	900
8	990.9608284	9.0391716	0.1019866		37.5	950
9	1086.168499	-136.16849	-1.5363543		42.5	1000
10	1094.962023	5.0379765	0.0568422		47.5	1000
11	1141.100271	58.899728	0.6645505		52.5	1100
12	884.0284583	-54.028454	-0.6095892		57.5	1100
13	1196.032044	-96.032043	-1.0835048		62.5	1100
14	1015.49554	84.504462	0.9534422		67.5	1120
15	1165.634982	-5.6349823	-0.0635780		72.5	1160
16	1230.445591	-110.44559	-1.2461293		77.5	1200
17	1242.170291	57.829708	0.6524778		82.5	1200
18	1291.239714	108.76028	1.2271144		87.5	1300
19	1331.515611	168.48438	1.9009662		92.5	1400
20	1297.102064	-97.10206	-1.0955777		97.5	1500

Look in the intercept, Assets and Experience Rows, in the Coefficients Column, to obtain the regression coefficients. These are repeated in Equation 3.1.2.

Eq. 3.1.2: ExtHours = 468.8 + 0.10 (Assets) - 2.9 (Experience)

The intercept, 468.8, is interpreted as in simple linear regression. External hours equal 468.8 when assets and experience equal to zero. Since we did not have any observations where assets and experience equal zero, there would be no practical interpretation.

The slope of 0.10 for assets means that external hours increase by 0.10 when assets increase by one ($1,000 since assets are in $000s), holding experience constant (that is, for interpretations involving team leaders with similar experience). The slope of -2.9 means that external hours decrease by 2.9 when experience increases by 1 year, holding assets constant (at similar levels).

Step 4: Test Practical Utility of Regression Model

As with simple linear regression, we use R² and Standard Error of the Model to test practical utility of the model.

Coefficient of Multiple Determination, Adjusted R²The R² in multiple regression is called the coefficient of multiple determination. It is safest to use the Adjusted R²(called Adjusted R Square in Excel), as this statistic adjusts for the number of independent variables in relation to sample size to prevent overestimating the amount of variability explained by the independent variables. A rule of thumb in multiple regression is that we should have at least 10 observations for each regression parameter. This model has three regression parameters (the intercept B₀ and the two slopes, B₁ and B₂) so we should have 30 observations. We only have 20 so the adjusted R² is a bit lower than the R² - sort of like keeping us honest. The formula used by Excel to compute the adjusted R² is:

Eq. 3.1.3: R²_adjusted = 1 - { ( 1 - R²) * [ (n - 1) / (n - 2) ] }

From this formula, you can see that if the sample size was 1,000, the difference between adjusted R Square and R Square is negligible.

The adjusted R Square is found in the Regression Statistics near the top of Worksheet 3.1.5. It is 0.796. The interpretation: assets and experience explain 79.6% of the sample variation in external hours. As in simple linear regression, we would like this to be as close to 100% as possible. A rule of thumb benchmark might be at least above 50%. If you are comparing two models, the model that is most practically useful, if everything else is equal, is the model with the highest adjusted R².

Standard Error of the Model
The Standard Error of the Model or Estimate is found directly below the adjusted R². It is 93.7. The interpretation: we would expect 95% (or most) of the actual values of external hours to be within +/- 2 * 93.7 of the predicted external hours. As with simple linear regression, we want the standard error to be as low as possible. A benchmark might be the standard error (93.7) divided by an actual values of Y (external hours) should be lower than ranges between .05 and 0.15, or 5% to 15%. If you are given two models, the model with the lowest standard error, if everything else is equal, is the model with the most practical utility.

Pause and Reflect

Although testing practical utility involves benchmarks and judgment, it is very important. Models that are not judged practically useful should not be used, even if they are proven to be statistically significant or useful. A model could have a very steep slope and thus be statistically significant, but it might have too much error around the regression line to be of practical use.

Step 5: Test Statistical Utility

There are multiple hypothesis tests in multiple regression because not only can we test for statistical utility of the model, but we can also test for statistical utility of the multiple independent variables.

Is the Model Statistically Significant or Useful?
The null and alternative hypotheses to test for model statistical utility are:

H₀: B₁ = B₂ = 0 (the model is not statistically useful)
H_a: At least 1 B is =/= 0 (the model is statistically useful)

(remember, =/= is the closest I can come to the "not equal" symbol)

The test statistic for this test of hypothesis is the F, which has a value of 38.16 (look in the Regression row of the ANOVA table in Worksheet 3.1.5, under the F Column). Recall, in concept, the F measures the ratio of variation explained by the regression model to the variation unexplained. The farther this number is from 1, the more variation is explained related to unexplained. The p-value for the F is 5.28E-07, or 0.000000528 (look in the Regression row of the ANOVA table in Worksheet 3.1.5,under the Significance F column). Since the p-value of 5.28E-07 is less than an alpha of 0.01, we reject the null hypothesis and conclude the model is statistically useful.

Attention to Detail

Did you notice that I used an alpha of 0.01, whereas I have been using convention of an alpha of 0.05 before now? We are entering into a new hypothesis test procedure in multiple regression where we do multiple hypothesis tests on the same data (with this example, we could actually do three tests: a model test, a test on X₁, and a test on X₂). Each test with the same data theoretically increases the error of rejecting a true null hypothesis. So, research convention suggests that we lower the alpha to 0.01 when we do multiple tests with the same data.

Are Assets Statistically Significant or Useful?
We can also test for statistical significance of the individual independent variables. The null and alternative hypotheses to test for the importance of assets are:

H₀: B₁ = 0 (assets are not important or not statistically useful)
H_a: B₁ =/= 0 (assets are important or statistically useful)

The test statistic is the t, which has a value of 6.82. This is found in the Asset row of the regression statistics in Worksheet 3.1.5. It's value implies that the slope of 0.098 or 0.10 is 6.82 standard errors of the slope from zero. The associated p-value is 2.99E-06. Since the p-value is less than an alpha value of 0.01, reject the null hypothesis and conclude that assets are important in predicting the value of external hours.

Is Experience Statistically Significant or Useful?
We can similarly test for the statistical significance of the experience variable.

H₀: B₂ = 0 (experience is not important or not statistically useful)
H_a: B₂ =/= 0 (experience is important or statistically useful)

The t statistic for this test is found in the Experience row of the regression summary. Its value is -0.21, with an associated p-value of 0.834. Since the p-value is greater than an alpha of 0.01, we fail to reject the null hypothesis and conclude that experience is not useful in predicting external hours. We added a variable that has no statistical utility and it should be removed from the model, at least in its current form. Another variable might be considered, or the analyst may just decide to go with assets for predicting external hours.

Caution

When I say that Experience is not of statistical utility or important in helping to predict External Hours, we must realize that we are only testing experience in it's linear form at this point. There may be a slight chance that experience is important in a curvilinear relationship with external hours. We will talk more on this in Module 3.2 Notes.

A last point before we go on to Step 6. Note in the Experience row of Worksheet 3.1.5 that the Lower 95% and Upper 95% Confidence Levels are -32.03 and 26.16 for the slope coefficient of -2.93. The interpretation: we are 95% confident that when experience increases by 1 year, the change in predicted external hours is between a decrease of 32 hours and an increase of 26 hours. Since sometimes external hours decrease, and sometimes external hours increase when experience increases by one, knowledge of experience does not help in predicting external hours. The 95% confidence interval on the slope confirms why we did not reject the null hypothesis of no statistical relationship between experience and external hours.

Step 6: Evaluate Assumptions of Regression Model

The assumptions and their evaluation are the same as in simple linear regression. The error or residual term should be normally distributed with mean of zero; the error should be constant for all values of the independent variables; and the error should be independent.

We evaluate the first assumption by examining the standardized residuals to see if any error terms are more than three standard errors from the regression line. The normal probability plot can also be used. The idea is that this assumption is generally met unless there are extreme values that would cause a skew in the data. In the Residual Output section of Worksheet 3.1.5, in the column "Standard Residuals," we find no value greater than +3 or less than -3. Thus, we would declare that this assumption is met.

The second and third assumptions are evaluated by looking at the residual plots. Worksheet 3.1.6 illustrates one of the two residual plots (one residual plot for each independent variable will be automatically produced by the regression add-in to Excel). This plot shows that the error terms do not exhibit a distinct pattern and are fairly constant in their departure from zero for the various values of the independent variable. Both residual plots would have to be so evaluated. As long as there is fairly constant departure from zero for all values of the independent variable, and as long as there is no distinct pattern such as a cone or football or curve, the second and third assumptions are met.

Worksheet 3.1.6

Step 7: Use Model for Prediction

If the multiple regression model proves to be statistically and practically useful, and if the assumptions are met, it can then be used for prediction with confidence. I recognize that this model had a problem in that the second independent variable, experience, was determined to be of no statistical utility. The analyst would be better off building a new model without the experience variable, perhaps using some other variable or simply use assets as the single independent variable.

However, I would like to go ahead and make a prediction just to demonstrate the step. As with simple linear regression, we begin by making a point estimate prediction. Let's make our prediction for the first observation in the data set. This prediction involves values of the independent variables as follows. The value for assets is $3,200,000 and the value for audit team leader experience is 8 years. The point estimate is then:

Eq. 3.1.4: ExtHours = 468.8 + 0.10 (3200) - 2.9 (8) = 765.6

The second step is to make the prediction interval for this observation. We are 95% confident that for a client with $3,200,000 in assets, and with an audit team leader having 8 years of experience, the audit will take:

Eq. 3.1.5: ExtHours = 765.6 + 2 * standard error of model

ExtHours = 765.6 + 2 * (93.7)

ExtHours = 765.6 + 187.4 or 578.2 to 953.

As a closing note, the actual audit took 700 External Hours for the first observation, which gives an error of -65.6. That error is 9.4% of the actual value.

Excel will produce line fit plots showing the predicted values of Y for all the values of X₁ and X₂, but produces these as separate two-dimensional plots. It does not have the capability to illustrate the multidimensional surface represented by the multiple regression prediction model.

That ends the introduction. Next, we examine how to incorporate categorical variables and interaction into the multiple regression model.

Reference:

Anderson, D., Sweeney, D., & Williams, T. (2007). Essentials of Modern Business Statistics with Microsoft Excel. Cincinnati, OH: South-Western, Chapter 13.

D. Groebner, P. Shannon, P. Fry & K. Smith. Business Statistics: A Decision Making Approach, Seventh Edition, Prentice Hall, Chapter 15.

Ken Black. Business Statistics for Contemporary Decision Making. Fourth Edition, Wiley. Chapter 13 & 14

| Return to Module Overview | Return to top of page |