Module 1.5 Notes "Testing Hypothesis"

 Index to Module One Notes 1.5: Testing Hypothesis

Module 1.4 covered one of the two classic methods of inferential statistics - confidence interval estimation. That method is most frequently used to answer exploratory research questions, such as "what is the average cycle time," or "what is the average profit contribution?" Explanatory research questions are frequently answered through the other method of inferential statistics - the testing of hypotheses, or hypothesis-testing. We start with a belief, claim, prediction, or assertion (hypothesis) about the parameter of interest (in this Module we have been studying the sample mean). Then we gather a sample, compute sample statistics, and make a conclusion about that hypothesis. Since we are working with a sample, we have to add an appropriate measure of reliability. The following notes cover a five-step methodology for hypothesis-testing.

Step One: State Null and Alternative Hypotheses

To illustrate statistical statements of hypotheses, I'll present three hypothesis test scenarios. In Scenario One, suppose someone believes the true average cycle time is less than 24 days. Recall that cycle time in this illustration is the time between when a company makes an order for material, and the time the material is received. But cycle time is a variable closely monitored in many activities - the time to pay an account, the time to process a customer service request, the time to hang a bottle of blood for a blood transfusion, and so forth.

Back to this scenario: the belief or prediction that true average cycle time is less than 24 days is generally based upon someone's knowledge of the underlying process. Perhaps they were involved in making an improvement to cycle time experiences. Last year, the average cycle time may have been 24 days, improvements were made, and this year they expect cycle time to improve (< 24 days, on average). This belief, or research hypothesis, is generally what the analyst tries to prove or support by gathering evidence. In statistics, it is called the alternative hypothesis, also known as the research hypothesis (symbol Ha or you will also see H1 in some texts and journals). The hypothesis that complements the alternative is called the null hypothesis (symbol H0), or hypothesis of equality. The statistical hypothesis statements are written as follows:

Scenario One
Ho: Population Mean = 24 (this is the null hypothesis)
Ha: Population Mean < 24 (this is the alternative hypothesis)

For Scenario Two, suppose someone believes the true average cycle time to be greater than 20.9 (the alternative hypothesis). Perhaps the scenario is that last year the company was using vendors who shipped by less than truckload and the average cycle time experience was 20.9 days (the null hypothesis). This year they switched to vendors who use truckload (cheaper but takes longer), thus they predict the cycle time will go up compared to last year. The null and alternative hypothesis statements are written:

Scenario Two
Ho: Population Mean = 20.9 (this is the null hypothesis)
Ha: Population Mean > 20.9 (this is the alternative hypothesis)

For Scenario Three, now suppose last year the average cycle time was 20 and vendors were replaced so that changes in cycle time are expected but no one knows if the changes lead to increased cycle time or decreased cycle time. The alternative hypothesis would be that the cycle time is not equal to 20. The null hypothesis is that the cycle time is equal to 20. These statements for this test would be written:

Scenario Three
Ho: Population Mean = 20 (this is the null hypothesis)
Ha: Population Mean =/= 20 (=/= is the symbol for not equal)

Note carefully that each scenario involved two statistical hypothesis statements. The first two scenarios involved directional hypothesis tests or one-tailed tests. Specifically, Scenario One is a lower-tail test: to support the alternative hypothesis, we would have to find sample means much lower than the hypothesized mean. Scenario Two is specifically called an upper-tail directional test: to support the alternative hypothesis, we would have to find sample means much higher than the hypothesized mean. The third scenario involved a non directional or two-tailed test. In all tests, the null hypothesis:

1. Contains the equal sign, thus it is sometimes referred to as the hypothesis of no difference or no effect.
Some texts write the null hypothesis as > for an alternative written as <; and < for an alternative written as > to make the null hypothesis always opposite in sign. Classical hypothesis testing simply puts the = sign in the null and that makes it easier to understand that the null is always the hypothesis of equality.

2. Is stated in specific terms regarding what the true value of the population parameter (in this case, the population mean) is predicted to be (24, 20.5 and 19 were the values for these three separate scenarios).

3. Is the hypothesis to be tested. We either reject or fail to reject the null.

If we reject the null hypothesis, we do so in favor of the alternative because the evidence we have gathered supports the alternative. If we fail to reject the null hypothesis, we have insufficient evidence to support the alternative. Thus the null hypothesis "presumes innocence until proven guilty."

In all statements of hypothesis tests, the alternative hypothesis:

1. Does not contain the equal sign.

2. Is the conclusion supported (must be true) when the null hypothesis is rejected (proven to be false).

Please note that researchers and business data analysts would only test one set of statistical hypothesis statements to answer a specific research question with a sample of data. A typical scenario might be scenario one. Last year, the mean cycle time was 24 days before a continuous improvement program was initiated and they want to see if cycle time decreases because of the continuous improvement. I presented two other scenarios for illustration purposes.

How we know whether to fail to reject the null hypothesis, or to reject the null in favor of the alternative? We gather a sample set of data from the population of interest, find the sample statistic that best estimates the population parameter under investigation, find the probability of getting the sample statistic if the null hypothesis is true, and make a conclusion based on the probability. The concept is simple: in scenario one, if our sample mean comes out to be 7 we would say, "there is no way we could get a sample mean equal to 7 if the true population mean was equal to 24, so reject the null in favor of the alternative." But what if the sample mean came out to be 23.999. There is a fairly high probability of getting a sample mean of 23.9999, if the true population mean was in fact 24, just by chance alone. In this case, we would fail to reject the null hypothesis. In other words, we haven't gathered enough evidence to reject the null hypothesis - the continuous improvement program did not work - the sample mean is only different from the true population mean because of sampling error.

While many hypothesis tests are supported by observation such as above, we obviously need more precision in making the decision to reject or fail to reject the null hypothesis. That precision comes in steps 2 through 5.

Step 2: Determine and Compute the Test Statistic

The general form of the hypothesis test statistics is shown in Equation 1.5.1:

Eq. 1.5.1: Test Statistic =
(Estimator - Hypothesized Value of Estimator) / Standard Error of the Estimator

There are two test statistics for testing a population mean; the Z and the t. The Z test of hypothesis for a population mean is used when the population standard deviation is known:

Eq. 1.5.2: Z =(Sample Mean - Hypothesized Mean) divided by
[Population Standard Deviation / Sq. Rt. (n)]

The assumptions for the Z test are:

1. The Population Standard Deviation is known.

2. Numerical data is independently and randomly drawn from a population known to be normally distributed

3. If the population is not normally distributed, it can be approximated by the normal distribution as long as the sample size is large ( > 30)

Suppose we want to test the following hypotheses and know that the population standard deviation is 3:

Scenario One
Ho: Population Mean = 24 (this is the null hypothesis)
Ha: Population Mean < 24 (this is the alternative hypothesis)

We would gather a sample, compute the sample mean and then solve for Z using Equation 1.5.2. Let's say the sample mean is 21, and the sample size is 30. Then:

Eq. 1.5.3: Z = (21 - 24) / [ ( 3 / Sq. Rt. (30) ]
= - 3 / ( 3/ 5.5)
= - 3 / 0.55
= - 5.5

The interpretation is: the sample mean of 21 is 5.5 standard errors less than the hypothesized mean of 24 (21 is quite far from 24 in terms of standard errors and in the direction of the alternative hypothesis, casting doubt on the truth of the null hypothesis, as we will see in Steps 3 - 5).

This formula could be easily constructed in an active cell on an Excel worksheet. The cell formula would be:

= (21-24)/((3/SQRT(30)))

The other test statistic is the t. The t test is used when the population standard deviation is unknown and must be estimated by the sample standard deviation. Because the population standard deviation is generally unknown, this is the more common test statistic. The formula for the t statistic is:

Eq. 1.5.4: t = (Sample Mean - Hypothesized Mean) divided by
[Sample Standard Deviation / Sq. Rt. ( n )]

The assumptions for the t test:

1. The population standard deviation is unknown and is estimated by the sample standard deviation.

2. Numerical data is independently and randomly drawn from a normal distribution,

3. If the population is not normal, but not very skewed and the sample size is large (> 30), the t distribution provides a good approximation to the sampling distribution of the sample mean.

Note the only difference in this formula and Eq. 1.5.2 is that we use the sample standard deviation, s, rather than the population standard deviation. If the sample mean is 21, the sample standard deviation is 3, the sample size is 30, and the hypothesized value of the population mean is 24, the t statistic has a value of - 5.5 similar to the result for Z in Eq. 1.5.3. Any difference in the Z and the t will appear when we compute probabilities in Step 3, although with large sample sizes, the Z and the t are identical, as was noted in Module 1.4 Notes.

Before we compute the probabilities, let's compute the values of the test statistics for Scenarios Two and Three. For each scenario, we will assume that the population standard deviation and the sample standard deviation are the same (3), the sample size is 30, and the sample mean is 21. Since the population and sample standard deviations are assumed equal, the Z and the t values will be equal.

Scenario Two
Eq. 1.5.5: Z = t = (21 - 20.9) / [ ( 3 / Sq. Rt. (30) ]
= 0.1 / ( 3/ 5.5)
= 0.1 / 0.55
= 0.18

The interpretation: the sample mean of 21 is just 0.18 standard errors from the hypothesized mean of 20.9 (20.9 is a reasonable expectation if the null hypothesis is indeed true, as we will see in Steps 3 - 5).

Scenario Three

Eq. 1.5.6: Z = t = (21 - 20) / [ ( 3 / Sq. Rt. (30) ]
= 1.0 / ( 3/ 5.5)
= 1.0 / 0.55
= 1.82

The interpretation: the sample mean of 21 is 1.82 standard errors from the hypothesized mean of 20 (since it isn't a clear case of rejecting the null hypothesis as in Scenario One, or failing to reject the null hypothesis as in Scenario Two, we need the precision of Step 3 to make the decision). Please note that since Scenario Three is a two-tailed test, we have to consider both the possibility of getting a Z or a t equal to 1.82 or -1.82.

Step 3: Find Probability of Test Statistics (p-Value)

At this point, we want to know the probability of obtaining a test statistic as small as the calculated statistic (for < directional alternative hypothesis tests such as the Scenario One example); the probability of obtaining a test statistic as large as the calculated statistic (for > directional alternative hypothesis tests such as the Scenario Two example); or the probability of obtaining a test statistic as large or as small as the calculated test statistic (for non directional =/= alternative hypothesis tests such as the Scenario Three example). In hypothesis testing, these probability values are called p-values. I should point out that I am following the p-value approach to hypothesis testing to focus on the approach most widely used in the literature, rather than the Critical Value approach provided in some texts (Anderson, Sweeney , Williams, pp. 334-337, Chapter 9).

Probability tables for finding p-values are built into Excel . For probabilities associated with Z test statistics (Z-Scores), select an active cell in an open worksheet, select Insert from the Standard Toolbar, then Function, Statistical, NORMSDIST, and then enter the Z-Score to get the cumulative probability up to the Z-Score. You may recall the NORMSDIST function from Module 1.3 Notes.

p-Values for Z Test Statistics

Scenario One
Eq. 1.5.7: =NORMSDIST(-5.5)
This equation is what you enter in an active cell on an Excel worksheet to get Probability (Z < -5.5) for this one-tail test. This is equivalent to stating Probability(Sample Mean < 21 given the true mean is 24). Excel returns 1.9E-08 in the active cell.

Interpretation: 1.9 E -08 is scientific notation, meaning move the decimal point eight digits to the left giving 0.000000019. This says the probability of getting a Z-Score of less than -5.5 is 0.000000019, a very small probability. Remember, the Z-value of -5.5 really represents the number of standard errors the sample mean of 21 is from the hypothesized mean of 24. Thus, the probability of us getting a sample mean of 21 is relatively low if the null hypothesis is true (population mean = 24); so the null hypothesis must be rejected in favor of the alternative based on evidence in this sample. We will put more precision in determining what is "relatively low" in Step 4.

Scenario Two

Eq. 1.5.8: =1 - NORMSDIST(0.18)
This equation is what you enter in an active cell of an Excel worksheet to get Probability(Z > 0.18) for this one tail test. This is equivalent to Probability (Sample Mean > 21 given the true mean is 20.9). Excel returns a p-value of 0.43 in the active cell. Note that since the NORMSDIST function returns a cumulative probability up to the Z-Score, to get the cumulative probability above the Z-Score we have to use =1 - NORMSDIST(0.18) for this upper-tail test since we are interested in probabilities above the Z-Score of 0.18.

Interpretation: The probability of obtaining a sample mean of 21 is relatively high if the null hypothesis is true (population mean = 20.9); so the null hypothesis cannot be rejected beyond a shadow of a doubt based on the evidence of this sample. As with Scenario One, we will put more precision in determining what is "relatively high" in Step 4.

Scenario Three

Eq. 1.5.9: =2 * (1-NORMSDIST(ABS(1.82))
This equation is what you enter in an active cell of an Excel worksheet to get Probability(Z > 1.82 or Z < -1.82) for this two-tail test. This is equivalent to Probability (Sample Mean > 21 or < 19 given the true mean is 20). Excel returns a p-value of 0.0688 in the active cell.

Interpretation: The probability of obtaining a sample mean of 21 is 3.44 % if the null hypothesis is true (population mean = 20). But since we are doing a two-tail test, we have to multiply 3.44% times 2 since we could just as likely get another sample mean 1.82 standard errors to the left of the hypothesized mean. Note that I used the absolute value function nested within the NORMSDIST function to give you a formula that would work for two-tail tests no matter if Z came out to be positive or negative. Further note, to determine if 6.88% is relatively high or low, we need the precision to be presented in Step 4. Before doing this, we need to compute the p-values for the t statistics.

P-values for t Test Statistics
To get probability p-values for the t test statistic from the t distribution, we use the TDIST function of Microsoft Excel. Select an active cell for the p-value, and then select Insert from the Standard Toolbar, Function, Statistical, and TDIST. Note that the TDIST function requires the absolute value of the t statistic we computed in Step 2, the degrees of freedom which is sample size minus one, and whether the test is one- or two-tails.

Scenario One
Eq. 1.5.10: =TDIST(5.5, 30-1,1)
When =TDIST(5.5,30-1,1) is entered in an active cell to get the p-value associated with the t statistic. Excel returns 3.16E-06, or 0.00000316. This probability would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.7. Note that the t value is always entered as a positive number in the TDIST function.

Scenario Two

Eq. 1.5.11: =TDIST(0.18, 30-1,1)
When =TDIST(0.18,30-1,1) is entered in the active cell, Excel returns 0.43. This probability would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.8. Note that you do not have to enter =1 - before the function as was done in Eq. 1.5.8 since the t Table in Excel was only constructed for tail probabilities.

Scenario Three

Eq. 1.5.12: = TDIST(1.82,30-1,2)
When =TDIST(0.18,30-1,2) is entered in an active cell, Excel returns 0.079. This p-value would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.9. Note that you do not have to multiple the p-value by 2 since for the t distribution, the number of tails for the test statistic is part of the function.

Have you noticed that the Z and the t values and probabilities are similar? They will be identical at really large sample sizes (above 120) and nearly identical at large sample sizes (30 or more). They will also be closer near the peak of the bell-shaped distribution, where probability values are closes to 0.50. Note that in Scenario Two, the p-values were identical at 0.43.

Step 4: Determine the Level of Statistical Significance

In the above equations, I have provided practical interpretations of low or high p-values associated with the Z or t test statistics. When the p-value was low, we rejected the null hypothesis in favor of the alternative. In hypothesis testing, this would indicate that the analysis is statistically significant. Scientific convention has established that in order to declare the result of a hypothesis test statistically significant, there can be no more than a 5% likelihood that the difference is due to chance (D. Sheskin, 1997). The 5% threshold is referred to as the level of significance. Knowing the level of significance for a study, we can now present a simple decision rule for rejecting or failing to reject the null hypothesis.

When the p-value is < 0.05, reject the null hypothesis. With such a low probability for the p-value, there is little likelihood that the observed difference between the sample mean and hypothesized mean is due to chance - it must be do to some program, process change, intervention or other effect.

When the p-value is > 0.05, fail to reject the null hypothesis. There is a high probability for the p-value that the observed difference between the sample mean and the hypothesized mean is so small that it must be do to chance involved in sampling error.

While that is the basics, let's examine the alpha level of significance in some more detail. Since we are working with a sample we can make two errors in hypothesis testing:

Type I Error: Rejecting a true null hypothesis. In hypothesis testing, the probability of making a type one error is labeled alpha, the level of significance.

Type II Error: Failing to reject a false null hypothesis. The probability of making a type two error is labeled beta.

The complementary probabilities are:

Confidence Coefficient: Failing to reject a true null hypothesis. This probability is labeled (1 - alpha). We already saw this in Module 1.4 Notes - it is the basis of the confidence interval. An alpha level of significance of 0.05 provides a 95% confidence coefficient.

Power: Rejecting a false null hypothesis. This probability is labeled (1 - beta).

The interested reader is referred to the Anderson, Sweeney and Williams optional reference for additional details. For our application, remember the simple decision rule. When the p-value is < alpha = 0.05, reject the null hypothesis; when the p-value is > alpha = 0.05, fail to reject the null hypothesis.

I close this step by saying that when a researcher believes that alpha = 0.05 is too high, they may elect to employ a 1 % level of significance, or even lower in some cases of medical research. The lower the level of significance, the less likely one would be to reject the null hypothesis and conclude that the research project is successful. While 0.05 is common in business applications, it is a matter of judgment. When the consequences of making a Type I error are really much more severe than the consequences associated with a Type II error, then researchers switch to the more conservative alpha = 0.01. This increases the beta probability which in turn lowers the power of the test so researchers recognize the tradeoffs. We will adopt the tradition of using 5% levels of significance for hypothesis testing.

Step 5: Making the Hypothesis Test Conclusion

The final step puts it all together with a three part conclusion:

1. Compare the p-value to alpha.

2. Based on the comparison, state whether to reject or fail to reject the null hypothesis.

3. Express the statistical decision in terms of the particular situation or scenario.

Here is the application of the three-part hypothesis test conclusion to our scenarios.

Scenario One
Z test: Since the p-value of 0.000000019 is < alpha, reject the null hypothesis, and conclude the population mean is less than 24 days.

t test: Since the p-value of 0.00000316 is < alpha, reject the null hypothesis, and conclude the population mean is less than 24 days.

Scenario Two

The Z and t test had same p-values: Since the p-value of 0.43 is > alpha, fail to reject the null hypothesis, and conclude that the population is equal to 20.9. To take into account the possibility of a Type II error, the statisticians prefer this statement: there is no evidence that the population mean cycle time is different from 20.9. I don't think the precise wording is as important as the care needed in conducting the analysis.

Scenario Three

Z test: Since the p-value of 0.688 is > alpha, fail to reject the null hypothesis, and conclude that the population mean is equal to 20 (again, there is no evidence that the average cycle time is different from 20).

t test: Since the p-value of 0.79 is > alpha, fail to reject the null hypothesis, and conclude that the population mean is equal to 20.

I like the three-part conclusion since it satisfies the statistician with "good science practice," and the business person since the conclusion is also in "English." When one reads Research Level I publications, you often simply see p < 0.05 for the conclusion. That is short hand for: since the p-value is less than alpha of 0.05, reject the null hypothesis in favor of the alternative, and conclude ....

A Note on Comparing the Confidence Interval to the Two-Tail Hypothesis Test

Recall in Module 1.4 that the 95% confidence interval for the population mean came out to be 21 + 1.1 or 19.9 to 22.1. In the two-tail hypothesis test of Scenario Three, the hypothesized mean of 20.0 falls within the range of 19.9 to 22.1. Since this range includes 20.0, we cannot refute the statement that the population mean is equal to 20. When the hypothesized mean falls outside the confidence interval, the p-value of the hypothesis test will be less than the significance level of 0.05 and we will reject the null hypothesis. For example, suppose the null hypothesis is that the true population mean is 18. This value falls outside the confidence interval range of 19.9 to 22.1, so we reject the null hypothesis and conclude that the true population mean is not equal to 18. The p-value for the hypothesis test will be < 0.05 in this example.

Ethical Issues

Remember that we are making inferences based on a sample, and it is assumed that the sample is unbiased without measurement error. Further, when we report the findings of a hypothesis test, we need to be as complete as possible so that our study can be replicated if need be.

I just heard a news report that the famous "Mozart Effect" study done in 1993 is being disputed. That study presented the hypothesis that classical music in the background would improve student problem-solving performance on certain categories of problems involving temporal and spatial dimensions. It has led to many extensions (playing classical music to babies to make them "smarter," etc.). This year, researchers at several universities tried to replicate the results and could not (they failed to reject the null hypothesis of no difference in performance). The original researcher claimed this in a news report the week of August 23, 1999, that the replications did not follow the original data collection method. That researcher is on somewhat shaky ground however, since the original study involved a convenience sample of upper division college students. There is nothing wrong with using convenience samples but one's conclusions cannot be made beyond that "population". Certainly not to infants.

The other ethical issue involves data snooping. One cannot look at the data, test statistic values and related p-values and then decide to use a one- or two-tail test. Recall in Scenario Three that the two-tail p-value was 0.0688, and we failed to reject the null hypothesis at alpha of 0.05. But, if we used a one-tail test, we would have rejected the null hypothesis at alpha level of 0.05 since the p-value was 0.0344.

Good science includes establishing your hypothesis, setting the level of significance, and collecting the data before the p-values are compared to alpha and the conclusion is reached.

References:

Anderson, D., Sweeney, D., & Williams , T. (2001). Contemporary Business Statistics for Business with Microsoft Excel. Cincinnati, OH: South-Western, Chapter 9 (except Section 9.6).

Sheskin, D. (1997). Handbook of Parametric and Non parametric Statistical Procedures. Boca Raton, FL: CRC Press LLC.