Module Five
Reliability
 Home
Site Map
 Modules
Module 5 Activities
  WebBoard
Resources
 
Module 5  Notes 

 

 
    Overview  
     
      Assessment results merely provide a limited measure of performance obtained at a specific point in time. 
       
      Reliability refers to the consistency of measurement  

      The measurement results must be shown to be consistent over: 
       

      • Different occasions (testing situations) / times (repeat testing occurs within 10 days),  
      • Different raters (in performance assessments), or  
      • Different samples of the same performance / content domain  
       

      This leads to confidence.  

 
    Sources of Measurement Error  
     
        Time between the repeated assessment -- fluctuations may occur in memory, attention, effort, fatigue, emotional strain, guessing (time between the first and second administration of the same test is relatively short) OR in intervening learning experiences, changes in health, forgetting, less comparable testing conditions (time between time between the first and second administration of the same test is relatively long) 

        Different Raters -- when less than perfect agreement occurs 

        Different Samples of a Particular Content Domain -- when one contains a greater number of tasks that are more familiar to the examinees than the other  


    Characteristics of Reliability 
     
      1. Reliability refers to the results (scores) of a test; not the test itself  
      2. An estimate of reliability refers to a particular type of consistency  
       
      • Consistency over time or  
      • Consistency among raters or 
      • Consistency of performance across tasks 

      •  
      3. Must have reliability to have validity 
       
      • A test can measure the same attribute with perfect consistency and not be valid. 

      •  
 
    Example 

    A test could be designed to measure knowledge of Greek letters and as such obtain high reliability.  
    If the results of this test were then used to determine a person's IQ, it would then become invalid but would remain reliable. If this same test was not reliable, then even as a measure of knowledge of Greek letters it would not be valid, because there would be no confidence in the test scores and therefore no certainty about a person's knowledge of Greek letters.

 
         
      4. Reliability is primarily statistical 
     
      When estimating reliability, we are trying to determine how much measurement error is present; the less error, the more reliable the instrument. 
 
    NOTE: Reliability, as will be shown later, is calculated using test results and is dependent on the group of people taking the test. Therefore, reliability is always estimated; we can not know definitively what the reliability is. It may change, probably will change, when a new group of people take the test. 
 
    Methods of Testing Reliability 
      
      1. Test-Retest -- no time lapse; estimate of stability 

      2. Test-Retest -- with time lapse; estimate of stability 

      3. Parallel forms --  estimate of equivalence 

      4. Test-Retest with Parallel Forms --  estimate of stability and equivalence 

      5. Split-Half -- estimate of internal consistency; used when the content on the test is hetergeneous 

      6. KR-20 or Coefficient Alpha -- KR-20 is used for items scored right or wrong; Alpha is used when items might receive partial credit; estimate of internal consistency; used when the content on the test is homogeneous 

      7. Inter-Rater Reliability -- estimate of consistency across raters 

       
        Reliability estimates range from 0 to 1. There are no negative reliability estimates. 
       
        How high should reliability be? It depends on how much error we are willing to tolerate.  

      The assessments we use to classify people into special education classes or other similar sorting procedures where the future of the student is seriously effected; these we call high-stakes assessments. For assessments classified as high-stakes, we tolerate very little error; so reliability estimates for these tests should be .9 or higher. 

      The assessments produced by test publishers that measure personality, social adjustment, vocational interests, aptitude, and achievement are usually classified as medium-stakes tests, and we will tolerate more error. Reliability estimates for these assessments should be .8 or higher. 
      Classroom assessments are generally considered low-stakes, because no single classroom test will determine retention or promotion for a course or grade. These tests need to have reliability estimates of .5 or better. 

        Any reliability estimate below .5 is evidence that the test’s results cannot be trusted, and should not be used for any purpose. 
       

        Remember: Validity coefficients were acceptable as low as .3; these were measures of correlations. However, reliability coefficients below .5 can not be tolerated. 

     

    Factors Influencing Reliability Measures  
     
      1. The content being measured; the more heterogeneous the content the lower the reliability will be 
      2. The spread of test scores; the narrower the spread, the lower the reliability will be 
      3. The objectivity of the scoring procedures; the more subjective the scoring procedures the lower the reliability will be 
      4. The time span between measures (tests); when a test is readministered to the same group, the longer the time span, the lower the reliability will be  
      5. The level of difficulty of tasks / items; when all items on one test are easy or all are hard or all are of moderate difficulty, the lower the reliability 
      6. The ability of the students being measured (tested); if they are all of the same ability (MH, gifted, or average), the lower the reliability 
      7. The number of tasks / items; the fewer the number, the lower the reliability 
       

 
    How to Increase Reliability 
     
      1. Make content homogeneous 
      2. Create a larger spread of test scores 
      3. Make scoring procedures as objective as possible 
      4. Shorten the time span between readministrations of the same test (less than 10 days) 
      5. Create items that vary in difficulty  
      6. Create a class that is heterogeneous in terms of ability 
      7. Create a large number of quality items 
       
Readings 
 
     
    Chapter 4  Reliability and other desired characcteristics 

    from Linn R.L. & Gronlund, N.E. (1995). Measurement and assessment in teaching. Englewood Cliffs, NJ: Merrill.

 
 
 
Home
Site Map
Modules
Module 5 Activities
WebBoard
Resources
 
Updated last January 1999 by CF&MD staff.  
Please direct comments to the website developer.  
Copyright 1999 Hewitt-Gervais & Baylen 
All rights reserved.
 
  
Florida Gulf Coast University  
College of Professional Studies  
School of Education