Usability: is the instrument economical (money and time) and easy to use (ease of administration, scoring, interpretation, reporting, application)?

NOTE: We will cover reliability and usability in future modules.

Validity Defined

Validity is the adequacy and appropriateness of the interpretations made from assessments

To what extent will the interpretation of the scores be appropriate, meaningful, and useful for the intended application of the results?

How well does it fulfill the function for which it is being used?

What are the consequences of the particular uses and interpretations that are made of the results?

We want evidence that the scores actually reflect whatever we expect them to measure.

Nature of Validity

Nature of Validity: Appropriateness of the interpretation of the results

1. Validity is a matter of degree. We judge the validity to be high, moderate, or low.

2. Validity is specific to some particular use or interpretation - Can only answer the question of validity in relation to a given specific task for a given population of examinees.

3. Validity is a unitary concept

4. Validity is an overall evaluative judgment

Types of Validity

There are five major types of validity. Your text calls them "Major Considerations in Assessment Validation". These five types are: content, test-criterion, construct, face, and consequence.

1. Content Validity

Content performance on a "universe" of items

First, you create or locate your list of behavioral objectives.

Second, you set the test next to the objectives and begin matching them, this test question measures this objective and so forth. If the test items can all be linked to a behavioral objective, this is the first evidence that the validity is high. If there are test items that cannot be linked to behavioral objectives on your list, or if there are objectives without a test item matched to it, then the validity may be moderate or low.

Third, count the number of items measuring each objective. The objectives of greatest importance should have more items than those of lesser importance. Another "rule of thumb", the minimum number test items is three per objective. If each objective is measured by three test items, then this is further evidence that the validity is high.

Questions to Ask:

a) Does the test content parallel the curricular objectives in content and process?

b) Are the test and curricular emphases in proper balance?

c) Is the test free from prerequisites that are irrelevant or incidental to the present measurement task?

d) Is a logical process link between Curriculum, Instruction, and Assessment

Best Evidence -- the Table of Specifications. This is what you are in the process of creating. You eventually will have a list of objectives, the level of Bloom’s taxonomy addressed by each objective, the form and type of assessment, the importance of each objective, and the items on the assessment that measure each objective.

2. Test-Criterion Validity

Performance on some criterion

For Concurrent Validity

First, you find a test that measures the same thing that your test measures.

Second, you give both tests to the same group or similar groups of people.

Third, you correlate the tests. This will give you the concurrent test-criterion validity.

For Predictive Validity

If instead, you hypothesize that your test will predict future performance at some related activity:

First, you identify what behavior / performance your test predicts (i.e., success in undergraduate programs).
Second, you give your test to a group of people (in this example, high school students).
Third, after this group has completed their first year of college, correlate the test scores with their grades. This will give you the predictive test-criterion validity.

Best Evidence -- the correlation coefficient

Go to Exercise 1

What does a correlation look like?

Go to Exercise 2

Another Example?

Correlation Coefficients

Correlation coefficients range from +1 to ?1. The sign indicates the direction of the relationship. As was stated earlier, if the scores are moving in the same direction, the regression line will go from lower left to upper right, and it is a positive relationship; hence the positive sign (+). If the scores are moving in opposite directions, the regression line will go from upper left to lower right, and it is a negative relationship; hence the negative sign (-).

The number indicates the strength of the relationship. The closer the number is to 1 the stronger the relationship. In the first example, because all points are on the line, the correlation would be 1, indicating a perfect relationship. This rarely happens. In the second example, the correlation would be above .9 because the points are very close to the line. The further the points are from the line, the weaker the relationship and the lower the number.

You will be asked to interpret correlations and the expectation is that you will address both the direction and the strength of the relationship. A correct answer for the second example assuming I told you the correlation was -.95, would be: "As scores on the Perceived Anxiety Scale go up, scores on the mid-term exam go down. This is a strong relationship."

You must (1) use the names of the measures given, (2) address the direction of the regression line, and (3) judge the strength of the relationship. You can use the term weak for correlations below .3, moderate for those falling between .3 and .8, and strong for those above .8.

Go to Exercise 3

3. Construct Validity

Construct Validity is the degree to which certain psychological traits or constructs are actually represented by the test performance

What is a construct? It is a psychological trait that is NOT directly observable, but is believed to exist based on observable behaviors that are made in response to the psychological trait.

First, define the domain or tasks to be measured.

Second, analyze the mental process required by the tasks. Example, people with high anxiety will sweat more, will perform less efficiently, etc. Then, decide how to measure sweat, performance efficiency, etc.

Third, compare the scores of known groups. Example, give test to those who are known to be high in anxiety, and a group known to be low in anxiety. Scores should differ greatly for each group.

Fourth, compare the scores before and after "treatment". Example, give treatment known to reduce anxiety and give test again. Scores should lower for those undergoing treatment.

Fifth, correlate scores with other measures that are supposed to measure the same thing.

You may realize that the first and second steps listed here parallel those for content validity and the third through fifth steps parallel those of test-criterion validity. Construct validity does encompass both content and test-criterion validity.

Best Evidence -- Correlation Coefficient and the Table of Specifications

4. Consequences

What are the consequences for using the test results?

High stakes decisions vs. low stakes decisions ? High stakes are those which directly impact the direction of your future. For example, the tests used to identify students for special education, college entrance exams, and High School Competency Test (HSCT) are considered high-stakes. If you do not pass these, the future of your life is impacted directly. Classroom tests are NOT high stakes. We never use the results from one classroom test to determine pass/fail or admittance to special education.

Intended as well as unintended consequences must be considered by teachers.

Teachers can judge this validity because they have the following information: They ...

a) know the learning objectives
b) know learning experiences of their students
c) have made observations of students consider:

1) do tasks match important learning objectives?

2) do students study harder in preparation for the assessment?

3) does assessment artificially constrain focus of students' study?

4) does assessment en(dis)courage exploration and creativity?

Best Evidence -- Teacher Judgment

5. Face Validity

Does the test look like it measures what it is supposed to measure?

Example

If you entered my class for a mid-term exam, and you were asked to take out a blank sheet of paper and draw the person’s face to the left of you; you would question the validity of the test simply because it does not appear to reflect the content you have been taught. This is most important to the person taking the test. Most of the time, the test should have face validity. There are instances where the test should not have face validity. For example, when the test is used to determine whether or not you have a split personality, are a kleptomaniac, or other socially unacceptable behavior.

NOTE: When a test is evaluated for validity, you should remember that not all tests will use all five types of validity.

Questions:

Which type of validity is most relevant in the measurement of academic achievement?
Which type of validity is most relevant in the measurement of future success in a related area?
Which type of validity is most relevant in the measurement of a psychological trait?

Factors In Instrument Which Influence Validity

1. Unclear directions

2. Reading vocabulary / sentence structure too difficult

3. Ambiguity

4. Inadequate time limits

5. Inappropriate level of difficulty of items

6. Poorly constructed test items

7. Inappropriate items for outcomes being measured

8. Test too short

9. Improper arrangement of items

10. Identifiable pattern of answers

Other Factors Which Influence Validity