Chapter 5
Assessing A Study's Validity

The perfect study has never been done, for there is always the possibility of human error in the study's design, execution, and analysis. How well a study is performed can be assessed to some degree by exploring how well the researchers sought to prevent and account for the common causes of distortion. These causes include chance, sample size, bias, and confounders.

Factors Affecting Study Validity

The results of a single study are seldom definitive because of the inevitable presence of chance and the possibility of improper study design.

The perfect study has never been done. Each study is performed at a distinct time and place with a unique group of subjects by investigators who utilize a particular study design. A diversity of methods are used to execute the study and to analyze its results. Chance always plays a part as well.

This is why different studies sometimes produce contradictory results. This is also why researchers almost never accept the results of a single study as definitive; the cumulative weight of evidence from several studies is necessary to draw sound conclusions.

How well a study is performed can be assessed to some degree by exploring how well the researchers sought to prevent and account for the common causes of distortion in a study.

In laboratory studies, the most common cause of distortion is chance, including the effects of sample size. For epidemiological studies, bias and confounding, as well as chance, are important factors to consider.

Chance

Observed differences between groups of experimental animals or between populations of humans may be due to chance rather than due to real differences.

All experimental and epidemiological studies are based on samples from larger populations. An epidemiological study that compares the incidence of lung cancer among smokers and nonsmokers, for example, uses a group of smokers selected from the population of all smokers, and a group of nonsmokers selected from the population of all nonsmokers.

The use of samples inevitably introduces uncertainty into the results. Any sample will, by chance, differ at least a little from its parent population. The smaller the sample and the more diverse the parent population, the more likely the sample will differ from the parent population.

In any study comparing two groups, at least some of the difference between groups is due to this sampling effect.

Scientists use mathematical tests, based on the science of statistics, to estimate the size of the sampling effect and, therefore, the amount of uncertainty associated with the study's findings. The results of these statistical tests are commonly expressed as p values and confidence intervals (see following text).

Example of chance: Assume two populations each have a 10% incidence of a particular disease. A random sample from one could, by chance, end up with 15% sick people and a sample from the other, 9% sick people. To an observer, it might appear that one population had a higher incidence of disease, while no difference actually existed.

Measures of Chance

P Values

P values indicate the probability that observed differences are due to chance rather than reflecting true differences.

A p value (probability value) shows the probability that the differences observed between two samples are due to chance variation in the samples rather than true differences in the parent populations.

P values range from 0 to 1. The closer the p value is to zero, the greater the likelihood that the difference between two samples reflects a real difference between the parent populations.

Although p values are not presented with percentage signs, you can think of them as representing a 0% to 100% probability that the observed difference is due to chance.

Example of p values: A p value of 0.001 (0.1%) means that only one time in a thousand would the difference observed be due to chance. A p value of 0.05 (5%) means that 5 times in 100 the observed difference would be due to chance. In other words, in the first example, there is a 99.9% probability that the difference is real. In the second example, there is a 95% probability that the difference is real.

A p value of 0.05 or smaller is usually the criteria used to indicate whether a study is statistically significant.

Traditionally, a p value of 0.05 (5%) or less is accepted as evidence that two populations are really different. A p value this small or smaller is taken to mean that the difference between the populations is statistically significant.

A p value of 0.05 represents a point on a continuum, not a scientific dividing line between "true" and "not true." A finding that has a p value of 0.06 (not statistically significant) could still reflect a true difference in the populations; a finding with a p value of 0.04 (statistically significant) could still be due to chance.

A scientist's conclusion that a difference between two samples reflects a true difference in the parent populations is as much a matter of judgment as of numbers.

Example of testing for chance using p values: In a clinical test of steroid injections for low back pain, doctors in Quebec City reported that 42% of 49 patients who received steroid injections experienced relief, compared to 33% of 48 patients who received injections of a harmless salt solution-a difference of 9%. Although it appears that steroids worked better than the placebo, a statistical test of the results yielded a p value of 0.50-that is, 50 times out of 100, samples of this size would be expected to show at least the observed 9% difference in pain relief, even if there were no difference in pain relief effectiveness between the steroids and the placebo. In other words, the results were not statistically significant, so this study offers no support for use of steroid injections in treating low back pain. It is still possible that steroids are effective in relieving pain, but that the sample size was too small to demonstrate the effect.

Confidence Intervals

Confidence intervals provide a margin of error for the study's conclusions.

Estimates of risk based on limited data cannot perfectly reflect the real world; they almost certainly will be off by a little bit, and possibly by a great deal. Pollsters acknowledge this uncertainty by providing margins of error for their polls. Laboratory scientists and epidemiologists call their margins of error confidence intervals.

When calculating risk ratios (see page 53), scientists have a process by which they use their data to calculate margins of error. Most of these margins of error are called 95% confidence intervals, meaning that there is a 95% probability that the risk will be no higher or lower than the extremes of this interval.

Example of testing for chance using confidence intervals: Chlorine in drinking water prevents epidemics of cholera and other water-borne illnesses, but a number of studies have suggested that chlorination also increases the risk of certain kinds of cancer. However, the studies' findings have been inconsistent, ranging from no risk of bladder cancer to a doubling of the risk. A group of researchers recently combined the results of 10 studies and reported that drinking chlorinated water is associated with a 38% increased risk of bladder cancer. But the 95% confidence interval was 1.01 to 1.87. In other words, the "best guess" is that the increased risk is 38%-but it could be as low as 1% or as high as 87%. And there is a 5% chance (five times out of 100) that a 95% confidence interval, does not include the true risk at all.

Confidence intervals below 1 indicate the suspect agent protects against disease; values near 1--it has no effect; over 1--it increases risk.

Since a risk ratio of 1 means no additional risk of disease from the exposure (see Causation Criteria ), a confidence interval that includes 1 (for example, 0.7-1.3) means that the data is not good enough to determine whether there is an increase, decrease, or no change in risk.

If, however, the confidence interval includes only numbers higher than 1 (for example, 1.1-1.5), a higher risk is likely. In this case, the risk is increased from 10% to 50%.

On the other hand, if the confidence interval is entirely below 1 (for example, 0.6-0.9), this suggests that the risk to the exposed population is only 60% to 90% of what it would have been without the exposure.

Example of confidence intervals: The chart below summarizes the results of several studies that examined the effect of vitamin B taken during pregnancy on the incidence of spinal cord defects in the baby. All of the studies showed a risk ratio below 1. However, the confidence interval in Study 1 ranged above 1, showing that the uncertainty in the study was too great to make a conclusion. In contrast, the confidence intervals of the other three studies included only values below 1, indicating vitamin B reduced the risk. Taking all the studies together, it can be concluded that vitamin B ingestion by pregnant women is likely to reduce the incidence of spinal cord defects in their offspring.

Effect of B vitamins taken during pregnancy on spinal cord defects in the baby. Risk Ratio (vertical lines) With 95% Confidence Intervals (horizontal lines)

Confidence Intervals in Epidemiological Studies

Confidence intervals provide more information than p values by indicating a range of risk associated with an exposure.

Confidence intervals are said to be wide if they contain a large range of values (e.g., 0.2 to 7.5), and narrow if they contain a small range of values (e.g., 2.5 to 3.0). The wider the interval, the less you can rely on the study.

In epidemiological studies, the width of the confidence interval is related to the sample size of the study. The larger the sample size, the narrower the confidence interval and the more precise the estimate of risk.

The width of the confidence interval is also related to the inherent variability of the factor being measured. The less the inherent variability, the narrower the confidence interval. (For example, body weight is highly variable, so a study of the effects of some factor on body weight would have a wider confidence interval than a similar study of the effects on head circumference, which is much less variable.)

P values and confidence intervals both are attempts to express the uncertainty associated with scientific studies. Most epidemiologists prefer confidence intervals because they provide more in-formation than p values about the range of risk associated with an exposure.

Confidence Intervals in Carcinogenesis Bioassays

Risk managers usually report the upper confidence limit, which may not be the most likely estimate of risk.

In carcinogenesis bioassays, the results of administering a limited number of high doses to rodents are extrapolated downward to the lower doses that humans may encounter in the real environment. This process may cause considerable uncertainty about the risk at low doses. Confidence intervals describe the range of possible risk.

In practice, risk managers try to minimize the possibility of adverse impacts on human health. To build in a margin of safety, they generally report only the 95% upper confidence limit of the risk. This may erroneously give the impression that the 95% upper bound is the most likely estimate of the risk.

Example of using confidence intervals in carcinogenesis bioassays: There is concern about contaminants such as PCBs in fish, because they have been shown to cause cancer in laboratory animals. Risk managers wish to determine the risk of cancer for individuals who eat fish containing particular levels of PCBs. Extrapolating from laboratory animal studies, and calculating the 95% upper confidence limit, they may report that eating certain fish so many times a week will lead to an additional risk of cancer of 1 in 10,000. Because this risk number represents only the 95% upper confidence limit, it is likely that the true risk is much lower.

Sample Size

The larger the sample, the more information it contains.

At least two possible explanations exist for results that indicate the substance under study has no effect on disease:

there is no effect.
there is an effect, but the study's sample size was too small to find it.

The probability of getting a statistically significant result-p less than or equal to 0.05-is closely tied to the size of the sample used in the study. The larger the sample size, the more information it contains, and the greater its ability to find an effect if one exists.

How large a sample is large enough? This depends on:

how common the effect is.
how much of an impact the suspected agent has on disease or effect rates.

The rarer the effect and/or the smaller the effect, the larger the sample needed to detect that effect. Small studies can reliably find only big risks.

A study's ability to find an effect is called its power. Formulas are available to calculate a study's power to find a specific-sized risk at different sample sizes. Typically, researchers would like a sample size large enough to have an 80% chance (80% power) of finding a doubling of the effect rate in the group exposed to the suspect agent.

(Ideally, researchers would like studies powerful enough to find any size risk. But such studies would be too expensive, take too long, and require too many participants to be feasible.)

Example of study size needed: Scientists believe that a high fat diet increases the risk of breast cancer. In order to have an 80% chance of detecting a 50% drop in breast cancer rates among women who halve their fat intake, researchers would need to enroll at least 30,000 women in a study and follow them for 10 years-and even then, there is a 20% chance (or 1 in 5) that the study would fail to detect the decline even if it occurred!

Bias and Confounders in Epidemiological Studies

Even the findings from a powerful study can be wrong if researchers made mistakes in selecting or classifying subjects.

Bias

An epidemiological study can be very large, very powerful-and very wrong.

Imbalances in the way researchers choose people for a study or systematic mistakes in the way they classify people as sick or well, exposed or unexposed, can produce a false relationship between an exposure and a disease. This distortion is called bias .

CLASSIFICATION ERRORS

Inevitably, some people in a study will be put in the wrong category of exposure or disease because of clerical errors, mistakes in the design or execution of the study, or an imperfect test for disease or exposure (no test is perfect). The question is: was the study designed or conducted in a way likely to produce a lot of these mistakes?

People's recollections are influenced by whether they are sick and looking for a cause, or healthy and unconcerned.

It is well-known that health status distorts memory of past events. People with disease are more likely to "remember" that they were exposed to a suspect substance whether they were or not; people who are healthy are more likely to forget past exposures. In a case-control study, this recall bias may produce a false association between an exposure and a disease.

Example of recall bias: Following widespread publicity about groundwater contamination in Santa Clara County, California, a case-control study found that women who recalled drinking tap water during their pregnancies had four times the risk of miscarriage as women who recalled drinking only bottled water. Later studies suggested that recall bias could account for much, if not all, of this association. Among other things:

studies done after the publicity had died down found a smaller risk than studies done at the height of the publicity; and
the association between tap water consumption and miscarriage was greater among women questioned by telephone interviewers (who knew whether they were talking to cases or controls) than among women who filled out mailed questionnaires.

Sometimes, researchers will let their desire to see a certain result color their interpretation of the data.

If the study is not a blind study, problems may arise from investigators' and interviewers' unconscious desires to see what they want to see. Doctors who think an experimental drug is effective may be more likely to see improvement in patients taking the drug; interviewers who believe a chemical causes disease may question cases more closely than controls about possible exposures.

Example of a classification error: In 1986, researchers at the University of California generated great excitement with a report from a clinical trial. They reported that patients with Alzheimer's disease who took the experimental drug Tacrine showed dramatic improvement in their mental function compared to Alzheimer's patients who took an inactive placebo. The report was considered especially promising because the study was supposedly "double blind"--neither researchers nor patients knew who was taking the real drug, so the researchers' evaluations of the patients' mental function could not be colored by their enthusiasm for the drug. A subsequent investigation by the Food and Drug Administration revealed that the researchers may indeed have known who was getting the real drug. In October 1992, a much larger, more rigorously controlled double blind study involving more than a dozen medical centers concluded that Tacrine did improve mental function, but so slightly that it was not noticeable to the evaluating doctors, and could be detected only by a battery of cognitive tests.

SELECTION ERRORS

Selection bias, the tendency for study subjects to be different from the population, is a major concern in epidemiological studies.

A military spokesman reported in 1992 that mail to the Joint Chiefs of Staff was running 4 to 1 against admitting gay people to the military. Most people would recognize that this mail survey may not represent the entire American public, because people who strongly opposed gays in the military were probably more moved to volunteer their opinions at that time, when the military was proposing to allow gays in the ranks, than were people who had other opinions or no opinion.

Selection bias occurs whenever the method of choosing participants results in a study group that differs from the parent population in ways that affect the study's conclusions. A major concern in epidemiological studies, selection bias can take many forms.

Example of selection bias: Following publicity about fears of an increased risk of leukemia among soldiers who had been deliberately exposed to radiation during the Army's 1957 Smoky atomic bomb test, the federal Centers for Disease Control (CDC) conducted a study that did indeed find an association between leukemia and participation in Smoky. The CDC investigators tried to trace all the Smoky participants, but had better luck finding those who had developed cancer, since many of these ill soldiers contacted the CDC on their own initiative. This form of selection bias, called volunteer bias, could be responsible for the apparent association.

Selection is seldom perfect, so the degree of selection bias is a relevant question to ask about any epidemiological study.

To avoid selection bias, careful researchers take pains to assure that study subjects represent all who could have participated, and that comparison groups are selected the same way. But the selection method is seldom perfect. A major question in any epidemiological study is how much the results are skewed by selection bias.

Examples of selection bias:

A Virginia jazz enthusiast and medical school professor, attempting to refute the perception that jazz musicians live fast and die young, compared the average life expectancy in America with the age at death of 86 noted jazz musicians. He concluded that jazz musicians live longer. But his study's selection bias was immediately pointed out: people don't become jazz musicians until adulthood, after they have survived the perils of childhood. Since childhood deaths are critical in reducing average life expectancy, looking at average age at death is bound to show a survival advantage. A way to avoid this bias would be to look at age-specific mortality rates: for example, the mortality rate for 50-year-old jazz musicians compared to other 50-year-olds.
National estimates of HIV infection rates in childbearing women were once based on screening of newborns' blood samples left over after other diagnostic tests were completed. But studies suggested that this practice seriously underestimated the true infection rate, because it tested only samples that contained sufficient leftover blood. Since HIV-infected newborns are sicker at birth, they undergo more tests, and their leftover samples often don't contain enough blood for screening. Therefore the blood that was screened was disproportionately from healthy babies. True HIV infection rates may be three times higher than these previous estimates.

PREVENTING BIAS

All studies have some bias. The question is whether investigators took care to reduce bias as much as possible.

A good study will foresee and try to control bias. Researchers might:

Use memory aids to avoid recall bias.
Lessen the chance of interviewers' influencing the responses by not revealing to interviewers the study's exact purpose or whether the people they're interviewing are cases or controls.
Minimize dropouts by recruiting participants for cohort studies from groups that are likely to be cooperative and easy to trace. Doctors and nurses are favored subjects for follow-up studies because they tend to be interested in research and are easy to trace through professional associations.
Choose more than one control group for a case-control study. For example, if the study gets the same results with a control group selected from the cases' friends and a control group selected from the cases' work associates, it provides some assurance that the findings are real and not just artifacts of the control selection process.

All studies have some bias. The question is whether investigators took care to reduce bias as much as possible, and whether the findings could be easily explained by bias.

Confounders are unmeasured characteristics that affect the study's outcome.

Confounding

Every morning at 10:00 a woman walks to the bus stop. At 10:01, a bus drives up and the woman gets on. A naive observer might conclude that the woman was responsible for the bus's arrival. In fact, both the woman's and the bus driver's behavior are driven by a hidden, third factor: the bus schedule.

The bus schedule is a confounder, producing an apparent cause-effect relationship between the woman's arrival and the bus's appearance.

Confounders are common in the study of disease. Confounders can:

produce a spurious association between a harmless agent and a disease.
mask an exposure's harmful effect.

Confounding occurs when a characteristic not considered by the researchers is in fact associated with both the disease and the suspected disease-causing agent.

Example of a confounder: In the 1970s, a group opposed to fluoridation of water in the U.S. reported a dramatic increase in cancer death rates in 10 U.S. cities that had switched to fluoridated water. But other investigators noted that the populations of these cities had changed dramatically in the same period, with growing proportions of elderly people and black people--groups at higher risk of cancer. After taking into account the confounding effects of age and race, the investigators found that these cities' cancer death rates actually had dropped since the introduction of fluoridated water.

Age, race, sex, income, and cigarette smoking are common confounders.

Age, sex, race, income, and cigarette smoking are among the most common confounders, because they affect the risk of many diseases and also are closely linked with other exposures that might be investigated as causes of disease. If a researcher does not at least take these factors into account when looking at causes of disease, the study is suspect.

Example of confounders:

Early epidemiological studies noted that the more children a woman had, the lower her risk of breast cancer. However, investigators quickly realized that a woman with a large family tends to be relatively young at the birth of her first child. Subsequent studies showed that it was young age at first birth, not number of full-term pregnancies, that reduces breast cancer risk. The apparent association between large families and lower breast cancer risk was the result of the confounding effect of age at first birth. (Note that older age at first birth does not directly "cause" breast cancer; it is probably a proxy for hormonal changes related to pregnancy.)
A difficulty in investigating occupational hazards is the problem of disentangling the effects of work place chemicals from the effects of lifestyle choices, like smoking habits. Chemicals in the work place are suspected causes of lung and bladder cancers. Smoking also causes these diseases. Blue collar workers are both more likely to smoke and more likely to have jobs that expose them to chemicals. A study that found an association between a work place chemical and lung cancer would be inconclusive unless it could rule out smoking as a possible confounder of the reported association.

A major concern is whether researchers have overlooked a confounder and therefore have not controlled for it.

Epidemiologists use a number of techniques to try to remove the distorting effect of confounders. They can restrict study subjects to one age group, sex, or race. Or they can use statistical techniques that show the effect on disease of the agent of interest with all known confounding factors held constant. The problem with these techniques is that they can control only for confounders the investigators have identified. A major concern is whether the investigators have overlooked a confounder and have not controlled for it.

|| Previous Page | Next Page ||

|| Table of Contents ||

The Foundation for American Communications (FACS)
3800 Barham Boulevard, Suite 409, Los Angeles, CA , 90068
Questions? Suggestions? email us: facs@facsnet.org

Chapter 5 Assessing A Study's Validity

P Values

Confidence Intervals in Epidemiological Studies

CLASSIFICATION ERRORS

|| Previous Page | Next Page ||

Chapter 5
Assessing A Study's Validity