"Confidence Interval Estimation and Hypothesis Testing for a Proportion" 
Index to Module 5 Notes 
Module Notes 5.1 covered the basics of
descriptive statistics for analysis of categorical data.
Pause and Reflect
We describe data for categorical variables by counting the observations of interest within the class or event divisions of the categorical variable. We can convert the count into a probability (also called long term relative frequency, proportion or percent) by dividing the number of observations in the event of interest by the total observations from all the events in the sample space. When we are interested in studying the relationship between two categorical variables, we can exam joint and marginal events; and joint, marginal and conditional probabilities as well.
When the data collected for a categorical
variable represents a sample, from a survey for example, we are often
interested in making an inference from the sample to the population.
Recall that the tools of inferential statistics are confidence
intervals and tests of hypothesis.
One of the most popular applications of confidence intervals for the
population proportion is the manner in which public opinion polls are
reported. Often following a major political news story, the media
sponsors public opinion polls via the telephone. Typically 1,000 to
2,000 registered voters are called across the country and asked
questions like "do you agree or disagree with _____ on ______." At
the 11p.m. news, we then hear something like "55% of registered votes
in the US agree with ___ on ____." The really good polling
organizations then follow the statement with, "the margin of error is
3%." Guess what  that's a confidence interval for a proportion!!
Confidence Interval Estimation
for the Proportion
We are interested in using a sample
proportion to estimate the population proportion. We do this similar
to any confidence interval.
Pause and Reflect
A confidence interval for the true population parameter being estimated is the sample estimate of that parameter plus/minus the sampling error. The sampling error is a function of the confidence level, the variation in the data, and the sample size.
The population parameters of interest up to now
were the mean, median and variance of numerical variables, and the
slope and dependent variable of regression relationships. The
population parameter of interest to us now is the population
proportion, p. The sample statistic used to estimate p is the sample
proportion, p_{s}. The sample proportion is determined by
dividing the count of observations in the event class of interest by
the total observations in all event classes in the sample space.
For example, suppose we surveyed 1,000 shoppers at Kmart and asked
them to rate their shopping experience as excellent, good or poor.
Here, there are three event classes in the categorical variable
called "rating." Further suppose that of the 1,000 shoppers, 272
rated the experience as excellent.
Equation 5.2.1 shows the calculation for the sample
proportion:
Eq. 5.2.1: p_{s} = X / n = 272 / 1,000 = 0.272 or 27.2%where X = the count of observations within the event class of interest
Technically, when a categorical variable of interest has only two outcomes, such as success or failure, male or female, success or failure, excellent rating or not excellent rating, that categorical variable follows a distribution called the binomial distribution. Any sampling distribution used in estimating a population parameter has a mean and a standard error. For a sampling distribution that follows the binomial distribution, the mean is p_{s} and the standard error is:
Eq. 5.1.2: Standard Error of p_{s} = Square Root [ p_{s }* ( 1p_{s}) / n] =Sq Rt [ 0.272 * (1  0.272) / 1,000] = 0.014
As long as the sample size is assumed to be
sufficiently large, the binomial distribution is approximated by
the normal distribution which means that we can use the Z statistic
to form the sampling error portion of the confidence interval. The
test for "large" is that both np_{s} and
n(1p_{s}) are at least 5. For this sample, we meet
the "large" test since np_{s} = 1,000 * 0.272 = 272 and n *
(1p_{s}) = 1,000 * (1 0.272) = 728: 272 and 728 are both
greater than 5.
Now we can put together the confidence interval for the population
proportion. The 95% confidence interval for the proportion is:
Eq. 5.2.3: 95% Confidence Interval for the True Population Proportion:= p_{s} + Z * Std Error of p_{s}
= p_{s} + 1.96 * Sq Rt [ p_{s }* (1  p_{s}) / n ]
= 0.272 + 1.96 * Sq Rt [0.272 * (1  0.272) / 1,000]
= 0.272 + 1.96 * 0.14
= 0.272 + 0.028
or 0.244 < p < 0.300
The interpretation: we are 95% confident that
between 24.4% and 30.0% of this population rate the shopping
experience at Kmart as excellent. Although the math is fairly easy
for a confidence interval for a proportion, we can set up an Excel
Worksheet template as shown in Worksheet 5.2.1: Row 1 Col I J K 2 Confidence Interval for Proportion
3 n 1000 4 Successes 272 5 Failures 728 = J3  J2 6 Confidence Level 0.95 7 8 Sample Proportion 0.272 = J4 / J3 9 Z 1.96 = NORMSINV(0.5+(J6/2)) 10 Standard Error 0.014 = SQRT((J8*(1J8))/J3) 11 HalfWidth of C.I. 0.028 = J9 * J10 12 Lower Limit of C.I. 0.244 = J8  J11 13 Upper Limit of C.I. 0.300 = J8 + J11
Worksheet 5.2.1
Do you remember how the analyst can make
the sampling error portion of the confidence interval tighter
(smaller)? That's right  increase the sample size. What happens when
we use a sample size of 10,000:
Eq. 5.2.4: 95% Confidence Interval for the True Population Proportion:= p_{s} + Z * Std Error of p_{s}
= p_{s} + 1.96 * Sq Rt [ p_{s }* (1  p_{s}) / n ]
= 0.272 + 1.96 * Sq Rt [ 0.272 * (1  0.272) / 10,000 ]
= 0.272 + 1.96 * 0.0045
= 0.272 + 0.009
or 0.263 < p < 0.281
We have certainly come closer to capturing the
true population proportion  of course we have, we sampled 10,000
people! That can be expensive. So, most of the time, you hear polling
companies say, the sampling error (they actually say margin of error
most of the time) is + 3%. Exactly how large a sample do you
need to get a margin of error of exactly 3%?
Computational Formula for the Sample Size for Estimating the
Proportion
To find this out, we solve the following sample size
formula:
Eq. 5.2.5: n = [ Z^{2} * p * (1p) ] / e^{2}where e = the desired sampling error,
and p = true population proportion.
Often (most of the time) we do not know the true population proportion  if we did, we would not need to find the confidence interval! So, a conservative approach is to let p = 0.50, which results in the largest value for n. If we want to be 95% confident, and want a sampling error of 3%, the formula becomes:
Eq. 5.2.6: n = [1.96 * (0.50) * (1  0.50) ] / 0.03 = 1,067.
As I stated earlier, this is the figure most
often used by polling companies. Of course, to get 1,067 responses,
you have to call more than 1,067 citizens to account for the non
responses.
That is how one estimates a population proportion using a confidence
interval. Let me close by noting what to do if you have a finite
population.
Finite Population Correction Factor
Most populations we deal with are really finite  not infinite.
Technically speaking, the confidence interval formula for estimating
the population proportion applies to infinite populations. However,
in reality, the formula works just fine as long as the population is
not too small compared to the sample size. The rule of thumb is that
as long as the sample size is less than 5% of the population size
(Sincich, p. 412), there is no need to apply the finite population
correction factor (Levine, 2000).
For example. What if the Kmart store used in these notes is located
in Cape Coral, with an approximate population of 100,000. If the
sample size is 1,000, then the ratio of sample size to population is
1,000/100,000 or 1% which passes the above rule of thumb. On the
other hand, if you planned to survey the 100 winter residents of
Upper Captiva on some issue, and your sample size was 50, then the
ratio of sample size to population size is 50% and the finite
population correction factor should be applied.
Any time we have to adjust for a finite population, the confidence
interval formula becomes:
Eq. 5.2.7: 95% Confidence Interval for the True Population Proportion with Finite Population Correction Factor:= p_{s} + Z * Std Error of p_{s} * Sq Rt [ (N  n) / (N  1)]
= p_{s} + 1.96 * Sq Rt [p_{s }* (1  p_{s}) / n ] * Sq Rt [ (N  n) / (N  1)]
The sample size formula becomes:
Eq. 5.2.8: n = n_{o} * N / [ n_{o} + (N  1 ) ] where n_{o} is the sample size determined from Eq. 5.2.5
The other tool of inferential statistics is the test of hypothesis,
this time for the true population proportion.
Test of Hypothesis for the
Proportion
Suppose we wanted to test if the true
population proportion of Kmart shoppers rating the shopping
experience at Kmart as excellent was equal to 32% versus not equal to
32%. We would proceed as with any hypothesis test, starting with the
hypotheses:
H_{0}: p = 0.32 (null hypothesis)
H_{a}: p =/= 0.32 (alternative hypothesis)
We would gather our data, compute the sample statistics, and then the test statistic. Using the above example, the sample proportion for a sample size of 1,000 shoppers came out to be 0.272 or 27.2%. The test statistic for large sample test of hypothesis for the population proportion is the Z:
Eq. 5.2.9: Z = (p_{s}  p) / Sq Rt [ p * (1  p) / n ] where p is hypothesized p.Z = (0.272  0.32 ) / Sq Rt [ 0.32 * (1 0.32) / 1000] =  3.254
The probability of us getting a sample
proportion of 27.2%, given a hypothesized proportion of 32% is the
pvalue, obtained by the Excel function =NORMSDIST(3.254). This
function returns a pvalue of 0.0005. Since this is a two tailed
probability, and since the =NORMSDIST function returns a lower tail
pvalue, we have to multiply the pvalue by 2 to get the full
pvalue. This gives 0.001. 1 Column R S T 2 Hypothesis Test for the Proportion
3 n 1000 4 Successes 272 5 Sample Proportion 0.272 = S4 / S3 6 Null Hypothesis p = 0.32 7 Standard Error 0.015 = SQRT((S6*(1S6))/S3) 8 Alpha 0.050 9 Z Test Statistic 3.254 = (S5  S6) / S7 10 pvalue (1tail) 0.0006 = (1NORMSDIST(ABS(S9))) 11 pvalue (2tail) 0.0011 = 2 * S10 12
The conclusion is that since the pvalue is less than a specified
value of alpha of 0.05, we reject the null hypothesis and conclude
that the population proportion is not equal to 32%. From the sample
proportion, we would further conclude that the proportion is less
than 32%, and from the confidence interval computed in Equation
5.2.3, we are confident that the true proportion lies between 24.4%
and 30.0%. Note that the confidence interval confirms the result of
the test of hypothesis, which will always be the case with a not
equal, twotailed alternative hypothesis.
Worksheet 5.2.2 provides a template for test of hypothesis for a
proportion.
Worksheet 5.2.2
This template is designed so that the user
need only insert information in cells S3, S4, S6 and S8. The
computation formulas in the remaining cells in Column S are shown as
notes in Column T.
Material covered in this module note can be used to answer questions
one and two of the Airline Satisfaction Survey Assignment in Main
Module 5 Overview.
Our last Module Note 5.3 covers methods of comparing multiple samples
of categorical data.
References:
Anderson, D., Sweeney, D., &
Williams, T. (2001). Contemporary Business Statistics with Microsoft
Excel. Cincinnati, OH: SouthWestern, Chapter 7 (Section 7.6) and
Chapter 8 (Section 8.4).
Levine, D., Berenson, M. & Stephan,
D. (1999). Statistics for Managers Using Microsoft Excel (2nd.
ed.). Upper Saddle River, NJ: PrenticeHall, Chapters
7 and 8.
Mason, R., Lind, D. & Marchal, W. (1999). Statistical
Techniques in Business and Economics (10th. ed.).
Boston: Irwin McGraw Hill, Chapters 8 and 9.
Sincich, T. (1992). Business
Statistics by Example (4th ed.). New York: Dellen/ Macmillan
Publishing Company


