"Testing Hypothesis" |
Index to Module One Notes |
Module 1.4 covered one of the two classic
methods of inferential statistics - confidence interval
estimation. That method is most frequently used to answer
exploratory research questions, such as "what is the average
cycle time," or "what is the average profit contribution?"
Explanatory research questions are frequently answered through
the other method of inferential statistics - the testing of
hypotheses, or hypothesis-testing. We start with a belief, claim,
prediction, or assertion (hypothesis) about the parameter of interest
(in this Module we have been studying the sample mean). Then we
gather a sample, compute sample statistics, and make a conclusion
about that hypothesis. Since we are working with a sample, we have to
add an appropriate measure of reliability. The following notes cover
a five-step methodology for hypothesis-testing.
Step One: State Null and
Alternative Hypotheses
To illustrate statistical statements of hypotheses, I'll present
three hypothesis test scenarios. In Scenario One, suppose someone
believes the true average cycle time is less than 24 days. Recall
that cycle time in this illustration is the time between when a
company makes an order for material, and the time the material is
received. But cycle time is a variable closely monitored in many
activities - the time to pay an account, the time to process a
customer service request, the time to hang a bottle of blood for a
blood transfusion, and so forth.
Back to this scenario: the belief or prediction that true average
cycle time is less than 24 days is generally based upon someone's
knowledge of the underlying process. Perhaps they were involved in
making an improvement to cycle time experiences. Last year, the
average cycle time may have been 24 days, improvements were made, and
this year they expect cycle time to improve (< 24 days, on
average). This belief, or research hypothesis, is generally what the
analyst tries to prove or support by gathering evidence. In
statistics, it is called the alternative hypothesis, also
known as the research hypothesis (symbol Ha
or you will also see H1 in some texts
and journals). The hypothesis that complements the alternative is
called the null hypothesis (symbol H0), or
hypothesis of equality. The statistical hypothesis statements are
written as follows:
Scenario OneHo: Population Mean = 24 (this is the null hypothesis)
Ha: Population Mean < 24 (this is the alternative hypothesis)
For Scenario Two, suppose someone believes the true average cycle time to be greater than 20.9 (the alternative hypothesis). Perhaps the scenario is that last year the company was using vendors who shipped by less than truckload and the average cycle time experience was 20.9 days (the null hypothesis). This year they switched to vendors who use truckload (cheaper but takes longer), thus they predict the cycle time will go up compared to last year. The null and alternative hypothesis statements are written:
Scenario TwoHo: Population Mean = 20.9 (this is the null hypothesis)
Ha: Population Mean > 20.9 (this is the alternative hypothesis)
For Scenario Three, now suppose last year the average cycle time was 20 and vendors were replaced so that changes in cycle time are expected but no one knows if the changes lead to increased cycle time or decreased cycle time. The alternative hypothesis would be that the cycle time is not equal to 20. The null hypothesis is that the cycle time is equal to 20. These statements for this test would be written:
Scenario ThreeHo: Population Mean = 20 (this is the null hypothesis)
Ha: Population Mean =/= 20 (=/= is the symbol for not equal)
Note carefully that each scenario involved two statistical hypothesis statements. The first two scenarios involved directional hypothesis tests or one-tailed tests. Specifically, Scenario One is a lower-tail test: to support the alternative hypothesis, we would have to find sample means much lower than the hypothesized mean. Scenario Two is specifically called an upper-tail directional test: to support the alternative hypothesis, we would have to find sample means much higher than the hypothesized mean. The third scenario involved a non directional or two-tailed test. In all tests, the null hypothesis:
1. Contains the equal sign, thus it is sometimes referred to as the hypothesis of no difference or no effect.Some texts write the null hypothesis as > for an alternative written as <; and < for an alternative written as > to make the null hypothesis always opposite in sign. Classical hypothesis testing simply puts the = sign in the null and that makes it easier to understand that the null is always the hypothesis of equality.2. Is stated in specific terms regarding what the true value of the population parameter (in this case, the population mean) is predicted to be (24, 20.5 and 19 were the values for these three separate scenarios).
3. Is the hypothesis to be tested. We either reject or fail to reject the null.If we reject the null hypothesis, we do so in favor of the alternative because the evidence we have gathered supports the alternative. If we fail to reject the null hypothesis, we have insufficient evidence to support the alternative. Thus the null hypothesis "presumes innocence until proven guilty."
In all statements of hypothesis tests, the alternative hypothesis:
1. Does not contain the equal sign.
2. Is the conclusion supported (must be true) when the null hypothesis is rejected (proven to be false).
Please note that researchers and business data analysts would only test one set of statistical hypothesis statements to answer a specific research question with a sample of data. A typical scenario might be scenario one. Last year, the mean cycle time was 24 days before a continuous improvement program was initiated and they want to see if cycle time decreases because of the continuous improvement. I presented two other scenarios for illustration purposes.
How we know whether to fail to reject the null
hypothesis, or to reject the null in favor of the alternative? We
gather a sample set of data from the population of interest, find the
sample statistic that best estimates the population parameter under
investigation, find the probability of getting the sample statistic
if the null hypothesis is true, and make a conclusion based on the
probability. The concept is simple: in scenario one, if our sample
mean comes out to be 7 we would say, "there is no way we could get a
sample mean equal to 7 if the true population mean was equal to 24,
so reject the null in favor of the alternative." But what if the
sample mean came out to be 23.999. There is a fairly high probability
of getting a sample mean of 23.9999, if the true population mean was
in fact 24, just by chance alone. In this case, we would fail to
reject the null hypothesis. In other words, we haven't gathered
enough evidence to reject the null hypothesis - the continuous
improvement program did not work - the sample mean is only different
from the true population mean because of sampling error.
While many hypothesis tests are supported by observation such as
above, we obviously need more precision in making the decision to
reject or fail to reject the null hypothesis. That precision comes in
steps 2 through 5.
Step 2: Determine and Compute
the Test Statistic
The general form of the hypothesis test
statistics is shown in Equation 1.5.1:
Eq. 1.5.1: Test Statistic =(Estimator - Hypothesized Value of Estimator) / Standard Error of the Estimator
There are two test statistics for testing a population mean; the Z and the t. The Z test of hypothesis for a population mean is used when the population standard deviation is known:
Eq. 1.5.2: Z =(Sample Mean - Hypothesized Mean) divided by[Population Standard Deviation / Sq. Rt. (n)]
The assumptions for the Z test are:
1. The Population Standard Deviation is known.2. Numerical data is independently and randomly drawn from a population known to be normally distributed
3. If the population is not normally distributed, it can be approximated by the normal distribution as long as the sample size is large ( > 30)
Suppose we want to test the following hypotheses and know that the population standard deviation is 3:
Scenario OneHo: Population Mean = 24 (this is the null hypothesis)
Ha: Population Mean < 24 (this is the alternative hypothesis)
We would gather a sample, compute the sample mean and then solve for Z using Equation 1.5.2. Let's say the sample mean is 21, and the sample size is 30. Then:
Eq. 1.5.3: Z = (21 - 24) / [ ( 3 / Sq. Rt. (30) ]= - 3 / ( 3/ 5.5)
= - 3 / 0.55
= - 5.5The interpretation is: the sample mean of 21 is 5.5 standard errors less than the hypothesized mean of 24 (21 is quite far from 24 in terms of standard errors and in the direction of the alternative hypothesis, casting doubt on the truth of the null hypothesis, as we will see in Steps 3 - 5).
This formula could be easily constructed in an active cell on an Excel worksheet. The cell formula would be:= (21-24)/((3/SQRT(30)))
The other test statistic is the t. The t test is used when the
population standard deviation is unknown and must be estimated by the
sample standard deviation. Because the population standard
deviation is generally unknown, this is the more common test
statistic. The formula for the t statistic is:
Eq. 1.5.4: t = (Sample Mean - Hypothesized Mean) divided by[Sample Standard Deviation / Sq. Rt. ( n )]
The assumptions for the t test:
1. The population standard deviation is unknown and is estimated by the sample standard deviation.
2. Numerical data is independently and randomly drawn from a normal distribution,
3. If the population is not normal, but not very skewed and the sample size is large (> 30), the t distribution provides a good approximation to the sampling distribution of the sample mean.
Note the only difference in this formula and Eq.
1.5.2 is that we use the sample standard deviation, s, rather than
the population standard deviation. If the sample mean is 21, the
sample standard deviation is 3, the sample size is 30, and the
hypothesized value of the population mean is 24, the t statistic has
a value of - 5.5 similar to the result for Z in Eq. 1.5.3. Any
difference in the Z and the t will appear when we compute
probabilities in Step 3, although with large sample sizes, the Z and
the t are identical, as was noted in Module 1.4 Notes.
Before we compute the probabilities, let's compute the values of the
test statistics for Scenarios Two and Three. For each scenario, we
will assume that the population standard deviation and the sample
standard deviation are the same (3), the sample size is 30, and the
sample mean is 21. Since the population and sample standard
deviations are assumed equal, the Z and the t values will be equal.
Scenario TwoEq. 1.5.5: Z = t = (21 - 20.9) / [ ( 3 / Sq. Rt. (30) ]= 0.1 / ( 3/ 5.5)
= 0.1 / 0.55
= 0.18The interpretation: the sample mean of 21 is just 0.18 standard errors from the hypothesized mean of 20.9 (20.9 is a reasonable expectation if the null hypothesis is indeed true, as we will see in Steps 3 - 5).
Scenario Three
Eq. 1.5.6: Z = t = (21 - 20) / [ ( 3 / Sq. Rt. (30) ]= 1.0 / ( 3/ 5.5)
= 1.0 / 0.55
= 1.82The interpretation: the sample mean of 21 is 1.82 standard errors from the hypothesized mean of 20 (since it isn't a clear case of rejecting the null hypothesis as in Scenario One, or failing to reject the null hypothesis as in Scenario Two, we need the precision of Step 3 to make the decision). Please note that since Scenario Three is a two-tailed test, we have to consider both the possibility of getting a Z or a t equal to 1.82 or -1.82.
Step 3: Find Probability of
Test Statistics (p-Value)
At this point, we want to know the
probability of obtaining a test statistic as small as the calculated
statistic (for < directional alternative hypothesis tests such as
the Scenario One example); the probability of obtaining a test
statistic as large as the calculated statistic (for > directional
alternative hypothesis tests such as the Scenario Two example); or
the probability of obtaining a test statistic as large or as small as
the calculated test statistic (for non directional =/= alternative
hypothesis tests such as the Scenario Three example). In hypothesis
testing, these probability values are called p-values. I should point
out that I am following the p-value approach to hypothesis testing to
focus on the approach most widely used in the literature, rather than
the Critical Value approach provided in some texts (Anderson, Sweeney
, Williams, pp. 334-337, Chapter 9).
Probability tables for finding p-values are built into Excel . For
probabilities associated with Z test statistics (Z-Scores), select an
active cell in an open worksheet, select Insert from the
Standard Toolbar, then Function, Statistical, NORMSDIST, and
then enter the Z-Score to get the cumulative probability up to the
Z-Score. You may recall the NORMSDIST function from Module 1.3 Notes.
p-Values for Z Test Statistics
Scenario OneEq. 1.5.7: =NORMSDIST(-5.5)This equation is what you enter in an active cell on an Excel worksheet to get Probability (Z < -5.5) for this one-tail test. This is equivalent to stating Probability(Sample Mean < 21 given the true mean is 24). Excel returns 1.9E-08 in the active cell.
Interpretation: 1.9 E -08 is scientific notation, meaning move the decimal point eight digits to the left giving 0.000000019. This says the probability of getting a Z-Score of less than -5.5 is 0.000000019, a very small probability. Remember, the Z-value of -5.5 really represents the number of standard errors the sample mean of 21 is from the hypothesized mean of 24. Thus, the probability of us getting a sample mean of 21 is relatively low if the null hypothesis is true (population mean = 24); so the null hypothesis must be rejected in favor of the alternative based on evidence in this sample. We will put more precision in determining what is "relatively low" in Step 4.
Scenario TwoEq. 1.5.8: =1 - NORMSDIST(0.18)This equation is what you enter in an active cell of an Excel worksheet to get Probability(Z > 0.18) for this one tail test. This is equivalent to Probability (Sample Mean > 21 given the true mean is 20.9). Excel returns a p-value of 0.43 in the active cell. Note that since the NORMSDIST function returns a cumulative probability up to the Z-Score, to get the cumulative probability above the Z-Score we have to use =1 - NORMSDIST(0.18) for this upper-tail test since we are interested in probabilities above the Z-Score of 0.18.
Interpretation: The probability of obtaining a sample mean of 21 is relatively high if the null hypothesis is true (population mean = 20.9); so the null hypothesis cannot be rejected beyond a shadow of a doubt based on the evidence of this sample. As with Scenario One, we will put more precision in determining what is "relatively high" in Step 4.
Scenario Three
Eq. 1.5.9: =2 * (1-NORMSDIST(ABS(1.82))This equation is what you enter in an active cell of an Excel worksheet to get Probability(Z > 1.82 or Z < -1.82) for this two-tail test. This is equivalent to Probability (Sample Mean > 21 or < 19 given the true mean is 20). Excel returns a p-value of 0.0688 in the active cell.
Interpretation: The probability of obtaining a sample mean of 21 is 3.44 % if the null hypothesis is true (population mean = 20). But since we are doing a two-tail test, we have to multiply 3.44% times 2 since we could just as likely get another sample mean 1.82 standard errors to the left of the hypothesized mean. Note that I used the absolute value function nested within the NORMSDIST function to give you a formula that would work for two-tail tests no matter if Z came out to be positive or negative. Further note, to determine if 6.88% is relatively high or low, we need the precision to be presented in Step 4. Before doing this, we need to compute the p-values for the t statistics.
P-values for t Test Statistics
To get probability p-values for the t test statistic from the t
distribution, we use the TDIST function of Microsoft Excel. Select an
active cell for the p-value, and then select Insert from the
Standard Toolbar, Function, Statistical, and
TDIST. Note that the TDIST function requires the absolute
value of the t statistic we computed in Step 2, the degrees of
freedom which is sample size minus one, and whether the test is one-
or two-tails.
Scenario OneEq. 1.5.10: =TDIST(5.5, 30-1,1)When =TDIST(5.5,30-1,1) is entered in an active cell to get the p-value associated with the t statistic. Excel returns 3.16E-06, or 0.00000316. This probability would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.7. Note that the t value is always entered as a positive number in the TDIST function.
Scenario TwoEq. 1.5.11: =TDIST(0.18, 30-1,1)When =TDIST(0.18,30-1,1) is entered in the active cell, Excel returns 0.43. This probability would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.8. Note that you do not have to enter =1 - before the function as was done in Eq. 1.5.8 since the t Table in Excel was only constructed for tail probabilities.
Scenario ThreeEq. 1.5.12: = TDIST(1.82,30-1,2)When =TDIST(0.18,30-1,2) is entered in an active cell, Excel returns 0.079. This p-value would be interpreted similar to the p-value for the Z test statistic interpreted in Eq. 1.5.9. Note that you do not have to multiple the p-value by 2 since for the t distribution, the number of tails for the test statistic is part of the function.
Have you noticed that the Z and the t values and probabilities are
similar? They will be identical at really large sample sizes (above
120) and nearly identical at large sample sizes (30 or more). They
will also be closer near the peak of the bell-shaped distribution,
where probability values are closes to 0.50. Note that in Scenario
Two, the p-values were identical at 0.43.
Step 4: Determine the Level of
Statistical Significance
In the above equations, I have provided
practical interpretations of low or high p-values associated with the
Z or t test statistics. When the p-value was low, we rejected the
null hypothesis in favor of the alternative. In hypothesis testing,
this would indicate that the analysis is statistically
significant. Scientific convention has established that in order
to declare the result of a hypothesis test statistically significant,
there can be no more than a 5% likelihood that the difference is due
to chance (D. Sheskin, 1997). The 5% threshold is referred to as the
level of significance. Knowing the level of significance for a
study, we can now present a simple decision rule for rejecting or
failing to reject the null hypothesis.
When the p-value is < 0.05, reject the null hypothesis. With such a low probability for the p-value, there is little likelihood that the observed difference between the sample mean and hypothesized mean is due to chance - it must be do to some program, process change, intervention or other effect.
When the p-value is > 0.05, fail to reject the null hypothesis. There is a high probability for the p-value that the observed difference between the sample mean and the hypothesized mean is so small that it must be do to chance involved in sampling error.
While that is the basics, let's examine the alpha level of significance in some more detail. Since we are working with a sample we can make two errors in hypothesis testing:
Type I Error: Rejecting a true null hypothesis. In hypothesis testing, the probability of making a type one error is labeled alpha, the level of significance.
Type II Error: Failing to reject a false null hypothesis. The probability of making a type two error is labeled beta.
The complementary probabilities are:
Confidence Coefficient: Failing to reject a true null hypothesis. This probability is labeled (1 - alpha). We already saw this in Module 1.4 Notes - it is the basis of the confidence interval. An alpha level of significance of 0.05 provides a 95% confidence coefficient.
Power: Rejecting a false null hypothesis. This probability is labeled (1 - beta).
The interested reader is referred to the
Anderson, Sweeney and Williams optional reference for additional
details. For our application, remember the simple decision rule. When
the p-value is < alpha = 0.05, reject the null hypothesis; when
the p-value is > alpha = 0.05, fail to reject the null
hypothesis.
I close this step by saying that when a researcher believes that
alpha = 0.05 is too high, they may elect to employ a 1 % level of
significance, or even lower in some cases of medical research. The
lower the level of significance, the less likely one would be to
reject the null hypothesis and conclude that the research project is
successful. While 0.05 is common in business applications, it is a
matter of judgment. When the consequences of making a Type I error
are really much more severe than the consequences associated with a
Type II error, then researchers switch to the more conservative alpha
= 0.01. This increases the beta probability which in turn lowers the
power of the test so researchers recognize the tradeoffs. We will
adopt the tradition of using 5% levels of significance for hypothesis
testing.
Step 5: Making the Hypothesis
Test Conclusion
The final step puts it all together with a
three part conclusion:
1. Compare the p-value to alpha.
2. Based on the comparison, state whether to reject or fail to reject the null hypothesis.
3. Express the statistical decision in terms of the particular situation or scenario.
Here is the application of the three-part hypothesis test conclusion to our scenarios.
Scenario OneZ test: Since the p-value of 0.000000019 is < alpha, reject the null hypothesis, and conclude the population mean is less than 24 days.
t test: Since the p-value of 0.00000316 is < alpha, reject the null hypothesis, and conclude the population mean is less than 24 days.Scenario Two
The Z and t test had same p-values: Since the p-value of 0.43 is > alpha, fail to reject the null hypothesis, and conclude that the population is equal to 20.9. To take into account the possibility of a Type II error, the statisticians prefer this statement: there is no evidence that the population mean cycle time is different from 20.9. I don't think the precise wording is as important as the care needed in conducting the analysis.Scenario Three
Z test: Since the p-value of 0.688 is > alpha, fail to reject the null hypothesis, and conclude that the population mean is equal to 20 (again, there is no evidence that the average cycle time is different from 20).
t test: Since the p-value of 0.79 is > alpha, fail to reject the null hypothesis, and conclude that the population mean is equal to 20.
I like the three-part conclusion since it
satisfies the statistician with "good science practice," and the
business person since the conclusion is also in "English." When one
reads Research Level I publications, you often simply see p < 0.05
for the conclusion. That is short hand for: since the p-value is less
than alpha of 0.05, reject the null hypothesis in favor of the
alternative, and conclude ....
A Note on Comparing the
Confidence Interval to the Two-Tail Hypothesis Test
Recall in Module 1.4 that the 95%
confidence interval for the population mean came out to be 21
+ 1.1 or 19.9 to 22.1. In the two-tail hypothesis test of
Scenario Three, the hypothesized mean of 20.0 falls within the range
of 19.9 to 22.1. Since this range includes 20.0, we cannot refute the
statement that the population mean is equal to 20. When the
hypothesized mean falls outside the confidence interval, the p-value
of the hypothesis test will be less than the significance level of
0.05 and we will reject the null hypothesis. For example, suppose the
null hypothesis is that the true population mean is 18. This value
falls outside the confidence interval range of 19.9 to 22.1, so we
reject the null hypothesis and conclude that the true population mean
is not equal to 18. The p-value for the hypothesis test will be <
0.05 in this example.
Ethical Issues
Remember that we are making inferences
based on a sample, and it is assumed that the sample is unbiased
without measurement error. Further, when we report the findings of a
hypothesis test, we need to be as complete as possible so that our
study can be replicated if need be.
I just heard a news report that the famous "Mozart Effect" study done
in 1993 is being disputed. That study presented the hypothesis that
classical music in the background would improve student
problem-solving performance on certain categories of problems
involving temporal and spatial dimensions. It has led to many
extensions (playing classical music to babies to make them "smarter,"
etc.). This year, researchers at several universities tried to
replicate the results and could not (they failed to reject the null
hypothesis of no difference in performance). The original researcher
claimed this in a news report the week of August 23, 1999, that the
replications did not follow the original data collection method. That
researcher is on somewhat shaky ground however, since the original
study involved a convenience sample of upper division college
students. There is nothing wrong with using convenience samples but
one's conclusions cannot be made beyond that "population". Certainly
not to infants.
The other ethical issue involves data snooping. One cannot look at
the data, test statistic values and related p-values and then
decide to use a one- or two-tail test. Recall in Scenario Three that
the two-tail p-value was 0.0688, and we failed to reject the null
hypothesis at alpha of 0.05. But, if we used a one-tail test, we
would have rejected the null hypothesis at alpha level of 0.05 since
the p-value was 0.0344.
Good science includes establishing your hypothesis, setting the level
of significance, and collecting the data before the p-values
are compared to alpha and the conclusion is reached.
References:
Anderson, D., Sweeney, D., &
Williams , T. (2001). Contemporary Business Statistics for
Business with Microsoft Excel. Cincinnati, OH: South-Western,
Chapter 9 (except Section 9.6).
Sheskin, D. (1997). Handbook of Parametric and Non parametric
Statistical Procedures. Boca Raton, FL: CRC Press LLC.
|
|
|