Chapter 5
|
The results of a single study are seldom definitive because of the inevitable presence of chance and the possibility of improper study design. |
The perfect study has never been done. Each study is performed at a distinct time and place with a unique group of subjects by investigators who utilize a particular study design. A diversity of methods are used to execute the study and to analyze its results. Chance always plays a part as well.
This is why different studies sometimes produce contradictory results. This is also why researchers almost never accept the results of a single study as definitive; the cumulative weight of evidence from several studies is necessary to draw sound conclusions.
How well a study is performed can be assessed to some degree by exploring how well the researchers sought to prevent and account for the common causes of distortion in a study.
In laboratory studies, the most common cause of distortion is chance, including the effects of sample size. For epidemiological studies, bias and confounding, as well as chance, are important factors to consider.
Observed differences between groups of experimental animals or between populations of humans may be due to chance rather than due to real differences. |
All experimental and epidemiological studies are based on samples from larger populations. An epidemiological study that compares the incidence of lung cancer among smokers and nonsmokers, for example, uses a group of smokers selected from the population of all smokers, and a group of nonsmokers selected from the population of all nonsmokers.
The use of samples inevitably introduces uncertainty into the results. Any sample will, by chance, differ at least a little from its parent population. The smaller the sample and the more diverse the parent population, the more likely the sample will differ from the parent population.
In any study comparing two groups, at least some of the difference between groups is due to this sampling effect.
Scientists use mathematical tests, based on the science of statistics, to estimate the size of the sampling effect and, therefore, the amount of uncertainty associated with the study's findings. The results of these statistical tests are commonly expressed as p values and confidence intervals (see following text).
Example of chance: Assume two populations each have a 10% incidence of a particular disease. A random sample from one could, by chance, end up with 15% sick people and a sample from the other, 9% sick people. To an observer, it might appear that one population had a higher incidence of disease, while no difference actually existed.
P values indicate the probability that observed differences are due to chance rather than reflecting true differences. |
A p value (probability value) shows the probability that the differences observed between two samples are due to chance variation in the samples rather than true differences in the parent populations.
P values range from 0 to 1. The closer the p value is to zero, the greater the likelihood that the difference between two samples reflects a real difference between the parent populations.
Although p values are not presented with percentage signs, you can think of them as representing a 0% to 100% probability that the observed difference is due to chance.
Example of p values: A p value of 0.001 (0.1%) means that only one time in a thousand would the difference observed be due to chance. A p value of 0.05 (5%) means that 5 times in 100 the observed difference would be due to chance. In other words, in the first example, there is a 99.9% probability that the difference is real. In the second example, there is a 95% probability that the difference is real.
A p value of 0.05 or smaller is usually the criteria used to indicate whether a study is statistically significant. |
Traditionally, a p value of 0.05 (5%) or less is accepted as evidence that two populations are really different. A p value this small or smaller is taken to mean that the difference between the populations is statistically significant.
A p value of 0.05 represents a point on a continuum, not a scientific dividing line between "true" and "not true." A finding that has a p value of 0.06 (not statistically significant) could still reflect a true difference in the populations; a finding with a p value of 0.04 (statistically significant) could still be due to chance.
A scientist's conclusion that a difference between two samples reflects a true difference in the parent populations is as much a matter of judgment as of numbers.
Example of testing for chance using p values: In a clinical test of steroid injections for low back pain, doctors in Quebec City reported that 42% of 49 patients who received steroid injections experienced relief, compared to 33% of 48 patients who received injections of a harmless salt solution-a difference of 9%. Although it appears that steroids worked better than the placebo, a statistical test of the results yielded a p value of 0.50-that is, 50 times out of 100, samples of this size would be expected to show at least the observed 9% difference in pain relief, even if there were no difference in pain relief effectiveness between the steroids and the placebo. In other words, the results were not statistically significant, so this study offers no support for use of steroid injections in treating low back pain. It is still possible that steroids are effective in relieving pain, but that the sample size was too small to demonstrate the effect.
Confidence intervals provide a margin of error for the study's conclusions. |
Estimates of risk based on limited data cannot perfectly reflect the real world; they almost certainly will be off by a little bit, and possibly by a great deal. Pollsters acknowledge this uncertainty by providing margins of error for their polls. Laboratory scientists and epidemiologists call their margins of error confidence intervals.
When calculating risk ratios (see page 53), scientists have a process by which they use their data to calculate margins of error. Most of these margins of error are called 95% confidence intervals, meaning that there is a 95% probability that the risk will be no higher or lower than the extremes of this interval.
Example of testing for chance using confidence intervals: Chlorine in drinking water prevents epidemics of cholera and other water-borne illnesses, but a number of studies have suggested that chlorination also increases the risk of certain kinds of cancer. However, the studies' findings have been inconsistent, ranging from no risk of bladder cancer to a doubling of the risk. A group of researchers recently combined the results of 10 studies and reported that drinking chlorinated water is associated with a 38% increased risk of bladder cancer. But the 95% confidence interval was 1.01 to 1.87. In other words, the "best guess" is that the increased risk is 38%-but it could be as low as 1% or as high as 87%. And there is a 5% chance (five times out of 100) that a 95% confidence interval, does not include the true risk at all.
Confidence intervals below 1 indicate the suspect agent protects against disease; values near 1--it has no effect; over 1--it increases risk. |
Since a risk ratio of 1 means no additional risk of disease from the exposure (see Causation Criteria ), a confidence interval that includes 1 (for example, 0.7-1.3) means that the data is not good enough to determine whether there is an increase, decrease, or no change in risk.
If, however, the confidence interval includes only numbers higher than 1 (for example, 1.1-1.5), a higher risk is likely. In this case, the risk is increased from 10% to 50%.
On the other hand, if the confidence interval is entirely below 1 (for example, 0.6-0.9), this suggests that the risk to the exposed population is only 60% to 90% of what it would have been without the exposure.
Example of confidence intervals: The chart below summarizes the results of several studies that examined the effect of vitamin B taken during pregnancy on the incidence of spinal cord defects in the baby. All of the studies showed a risk ratio below 1. However, the confidence interval in Study 1 ranged above 1, showing that the uncertainty in the study was too great to make a conclusion. In contrast, the confidence intervals of the other three studies included only values below 1, indicating vitamin B reduced the risk. Taking all the studies together, it can be concluded that vitamin B ingestion by pregnant women is likely to reduce the incidence of spinal cord defects in their offspring.
Effect of B vitamins taken during pregnancy on spinal cord defects in the baby. Risk Ratio (vertical lines) With 95% Confidence Intervals (horizontal lines)
Confidence intervals provide more information than p values by indicating a range of risk associated with an exposure. |
Confidence intervals are said to be wide if they contain a large range of values (e.g., 0.2 to 7.5), and narrow if they contain a small range of values (e.g., 2.5 to 3.0). The wider the interval, the less you can rely on the study.
In epidemiological studies, the width of the confidence interval is related to the sample size of the study. The larger the sample size, the narrower the confidence interval and the more precise the estimate of risk.
The width of the confidence interval is also related to the inherent variability of the factor being measured. The less the inherent variability, the narrower the confidence interval. (For example, body weight is highly variable, so a study of the effects of some factor on body weight would have a wider confidence interval than a similar study of the effects on head circumference, which is much less variable.)
P values and confidence intervals both are attempts to express the uncertainty associated with scientific studies. Most epidemiologists prefer confidence intervals because they provide more in-formation than p values about the range of risk associated with an exposure.
Risk managers usually report the upper confidence limit, which may not be the most likely estimate of risk. |
In carcinogenesis bioassays, the results of administering a limited number of high doses to rodents are extrapolated downward to the lower doses that humans may encounter in the real environment. This process may cause considerable uncertainty about the risk at low doses. Confidence intervals describe the range of possible risk.
In practice, risk managers try to minimize the possibility of adverse impacts on human health. To build in a margin of safety, they generally report only the 95% upper confidence limit of the risk. This may erroneously give the impression that the 95% upper bound is the most likely estimate of the risk.
Example of using confidence intervals in carcinogenesis bioassays: There is concern about contaminants such as PCBs in fish, because they have been shown to cause cancer in laboratory animals. Risk managers wish to determine the risk of cancer for individuals who eat fish containing particular levels of PCBs. Extrapolating from laboratory animal studies, and calculating the 95% upper confidence limit, they may report that eating certain fish so many times a week will lead to an additional risk of cancer of 1 in 10,000. Because this risk number represents only the 95% upper confidence limit, it is likely that the true risk is much lower.
The larger the sample, the more information it contains. |
At least two possible explanations exist for results that indicate the substance under study has no effect on disease:
The probability of getting a statistically significant result-p less than or equal to 0.05-is closely tied to the size of the sample used in the study. The larger the sample size, the more information it contains, and the greater its ability to find an effect if one exists.
How large a sample is large enough? This depends on:
The rarer the effect and/or the smaller the effect, the larger the sample needed to detect that effect. Small studies can reliably find only big risks. |
A study's ability to find an effect is called its power. Formulas are available to calculate a study's power to find a specific-sized risk at different sample sizes. Typically, researchers would like a sample size large enough to have an 80% chance (80% power) of finding a doubling of the effect rate in the group exposed to the suspect agent.
(Ideally, researchers would like studies powerful enough to find any size risk. But such studies would be too expensive, take too long, and require too many participants to be feasible.)
Example of study size needed: Scientists believe that a high fat diet increases the risk of breast cancer. In order to have an 80% chance of detecting a 50% drop in breast cancer rates among women who halve their fat intake, researchers would need to enroll at least 30,000 women in a study and follow them for 10 years-and even then, there is a 20% chance (or 1 in 5) that the study would fail to detect the decline even if it occurred!
Even the findings from a powerful study can be wrong if researchers made mistakes in selecting or classifying subjects. |
An epidemiological study can be very large, very powerful-and very wrong.
Imbalances in the way researchers choose people for a study or systematic mistakes in the way they classify people as sick or well, exposed or unexposed, can produce a false relationship between an exposure and a disease. This distortion is called bias .
Inevitably, some people in a study will be put in the wrong category of exposure or disease because of clerical errors, mistakes in the design or execution of the study, or an imperfect test for disease or exposure (no test is perfect). The question is: was the study designed or conducted in a way likely to produce a lot of these mistakes?
People's recollections are influenced by whether they are sick and looking for a cause, or healthy and unconcerned. |
It is well-known that health status distorts memory of past events. People with disease are more likely to "remember" that they were exposed to a suspect substance whether they were or not; people who are healthy are more likely to forget past exposures. In a case-control study, this recall bias may produce a false association between an exposure and a disease.
Example of recall bias: Following widespread publicity about groundwater contamination in Santa Clara County, California, a case-control study found that women who recalled drinking tap water during their pregnancies had four times the risk of miscarriage as women who recalled drinking only bottled water. Later studies suggested that recall bias could account for much, if not all, of this association. Among other things:
Sometimes, researchers will let their desire to see a certain result color their interpretation of the data. |
If the study is not a blind study, problems may arise from investigators' and interviewers' unconscious desires to see what they want to see. Doctors who think an experimental drug is effective may be more likely to see improvement in patients taking the drug; interviewers who believe a chemical causes disease may question cases more closely than controls about possible exposures.
Example of a classification error: In 1986, researchers at the University of California generated great excitement with a report from a clinical trial. They reported that patients with Alzheimer's disease who took the experimental drug Tacrine showed dramatic improvement in their mental function compared to Alzheimer's patients who took an inactive placebo. The report was considered especially promising because the study was supposedly "double blind"--neither researchers nor patients knew who was taking the real drug, so the researchers' evaluations of the patients' mental function could not be colored by their enthusiasm for the drug. A subsequent investigation by the Food and Drug Administration revealed that the researchers may indeed have known who was getting the real drug. In October 1992, a much larger, more rigorously controlled double blind study involving more than a dozen medical centers concluded that Tacrine did improve mental function, but so slightly that it was not noticeable to the evaluating doctors, and could be detected only by a battery of cognitive tests.
Selection bias, the tendency for study subjects to be different from the population, is a major concern in epidemiological studies. |
A military spokesman reported in 1992 that mail to the Joint Chiefs of Staff was running 4 to 1 against admitting gay people to the military. Most people would recognize that this mail survey may not represent the entire American public, because people who strongly opposed gays in the military were probably more moved to volunteer their opinions at that time, when the military was proposing to allow gays in the ranks, than were people who had other opinions or no opinion.
Selection bias occurs whenever the method of choosing participants results in a study group that differs from the parent population in ways that affect the study's conclusions. A major concern in epidemiological studies, selection bias can take many forms.
Example of selection bias: Following publicity about fears of an increased risk of leukemia among soldiers who had been deliberately exposed to radiation during the Army's 1957 Smoky atomic bomb test, the federal Centers for Disease Control (CDC) conducted a study that did indeed find an association between leukemia and participation in Smoky. The CDC investigators tried to trace all the Smoky participants, but had better luck finding those who had developed cancer, since many of these ill soldiers contacted the CDC on their own initiative. This form of selection bias, called volunteer bias, could be responsible for the apparent association.
Selection is seldom perfect, so the degree of selection bias is a relevant question to ask about any epidemiological study. |
To avoid selection bias, careful researchers take pains to assure that study subjects represent all who could have participated, and that comparison groups are selected the same way. But the selection method is seldom perfect. A major question in any epidemiological study is how much the results are skewed by selection bias.
All studies have some bias. The question is whether investigators took care to reduce bias as much as possible. |
A good study will foresee and try to control bias. Researchers might:
All studies have some bias. The question is whether investigators took care to reduce bias as much as possible, and whether the findings could be easily explained by bias.
Confounders are unmeasured characteristics that affect the study's outcome. |
Every morning at 10:00 a woman walks to the bus stop. At 10:01, a bus drives up and the woman gets on. A naive observer might conclude that the woman was responsible for the bus's arrival. In fact, both the woman's and the bus driver's behavior are driven by a hidden, third factor: the bus schedule.
The bus schedule is a confounder, producing an apparent cause-effect relationship between the woman's arrival and the bus's appearance.
Confounders are common in the study of disease. Confounders can:
Confounding occurs when a characteristic not considered by the researchers is in fact associated with both the disease and the suspected disease-causing agent.
Example of a confounder: In the 1970s, a group opposed to fluoridation of water in the U.S. reported a dramatic increase in cancer death rates in 10 U.S. cities that had switched to fluoridated water. But other investigators noted that the populations of these cities had changed dramatically in the same period, with growing proportions of elderly people and black people--groups at higher risk of cancer. After taking into account the confounding effects of age and race, the investigators found that these cities' cancer death rates actually had dropped since the introduction of fluoridated water.
Age, race, sex, income, and cigarette smoking are common confounders. |
Age, sex, race, income, and cigarette smoking are among the most common confounders, because they affect the risk of many diseases and also are closely linked with other exposures that might be investigated as causes of disease. If a researcher does not at least take these factors into account when looking at causes of disease, the study is suspect.
Example of confounders:
A major concern is whether researchers have overlooked a confounder and therefore have not controlled for it. |
Epidemiologists use a number of techniques to try to remove the distorting effect of confounders. They can restrict study subjects to one age group, sex, or race. Or they can use statistical techniques that show the effect on disease of the agent of interest with all known confounding factors held constant. The problem with these techniques is that they can control only for confounders the investigators have identified. A major concern is whether the investigators have overlooked a confounder and have not controlled for it.
|| Table of Contents ||
The Foundation for American Communications (FACS)
3800 Barham Boulevard, Suite 409, Los Angeles, CA , 90068
Questions? Suggestions? email us:
|