Iowa Testing Programs - The University of Iowa College of Education
navigation space

Interpreting Test Scores | Scoring Services | Reporting Results

Interpreting Test Scores

This page describes which scores to use to accomplish each of several purposes and tells what the different types of scores mean.

Three of the fundamental purposes for testing are (1) to describe each student's developmental level within a test area, (2) to identify a student's areas of relative strength and weakness in subject areas, and (3) to monitor year-to-year growth in the basic skills. To accomplish any one of these purposes, it is important to select the type of score from among those reported that will permit the proper interpretation. Scores such as percentile ranks, grade equivalents, and standard scores differ from one another in the purposes they can serve, the precision with which they describe achievement, and the kind of information they provide. A closer look at these types of scores will help differentiate the functions they can serve and the meanings they can convey. Additional detail can be found in the ITED Interpretive Guide for Teachers and Counselors.

In Iowa, school districts can obtain scores that are reported using national norms or Iowa norms. On some reports, both kinds of scores are reported. The difference is simply in the group with which comparisons are made to obtain score meaning. A student's Iowa percentile rank (IPR) compares the student's score with those of others in his/her grade in Iowa. The student's national percentile rank (NPR) compares that same score with those of others in his/her grade in the nation. For other types of scores described below, there are both Iowa and national scores available to Iowa schools.

Types of Scores

Raw Score (RS)
The number of questions a student gets right on a test is the student's raw score (assuming each question is worth one point). By itself, a raw score has little or no meaning. The meaning depends on how many questions are on the test and how hard or easy the questions are. For example, if Kati got 10 right on both a math test and a science test, it would not be reasonable to conclude that her level of achievement in the two areas is the same. This illustrates why raw scores are usually converted to other types of scores for interpretation purposes.

Percent Correct (PC)
When the raw score is divided by the total number of questions and the result is multiplied by 100, the percent-correct score is obtained. Like raw scores, percent-correct scores have little meaning by themselves. They tell what percent of the questions a student got right on a test, but unless we know something about the overall difficulty of the test, this information is not very helpful. Percent-correct scores are sometimes incorrectly interpreted as percentile ranks, which are described below. The two are quite different.

Grade Equivalent (GE)
The grade equivalent is a number that describes a student's location on an achievement continuum. The continuum is a number line that describes the lowest level of knowledge or skill on one end (lowest numbers) and the highest level of development on the other end (highest numbers). The GE is a decimal number that describes performance in terms of grade level and months. For example, if a ninth-grade student obtains a GE of 10.4 on the Vocabulary test, his score is like the one a typical student finishing the fourth month of tenth grade would likely get on the Vocabulary test. The GE of a given raw score on any test indicates the grade level at which the typical student makes this raw score. The digits to the left of the decimal point represent the grade and those to the right represent the month within that grade.

The median raw score of students tested in the spring of each grade is assigned a GE that represents the eighth month of that grade. Thus, the GE that corresponds to the median raw score of students tested in the spring of eighth grade is 8.8; for ninth grade that GE is 9.8, for tenth grade 10.8, and so on. By definition, the average yearly growth is 1.0 units (10 months). High-achieving students should be expected to gain more than 10 months in a year, while low-achieving students should be expected to gain less than 10 months in a year.

GE scores are most useful to elementary educators. Because of the generally common curriculum at the elementary level, a score expressed in terms of grade level can meaningfully describe a student's developmental status and growth from one year to the next. As that common curriculum becomes more complex in the upper grades, these scores can provide important information because they refer to a continuum of typical achievement at each grade level. However, at the secondary level, GEs are not particularly useful for two major reasons. First, at this level, a common curriculum no longer exists. Second, GEs above grade 12 have been extrapolated and have no relationship to actual years. In fact, the maximum reported GE for Forms A and B is 13+. Given these limitations, most high schools will probably prefer to report ITED results in standard score units. The standard score scale used with ITED is discussed next.

Developmental Standard Score (SS)
The developmental standard score (SS) is another score used to describe the location of a student's test performance on an achievement continuum. The table below shows the national SSs that have been assigned to typical performance of grade groups on each test at grades 4-12 in the spring of the year.

Grade
4
5
6
7
8
9
10
11
12
SS
200
214
227
239
250
260
268
275
280

The scale shows that average annual growth decreases as students move up from one grade to the next. For example, the average growth from grade 8 to grade 9 is ten standard-score points, but from grade 11 to grade 12 the average is only five points. Because it is widely believed that the rate of growth in most achievement areas decreases as grade level increases, the developmental standard score scale seems to reflect typical student development more realistically than the GE scale, which, as noted above, presumes equal annual growth between any pair of grades.

The main disadvantage to using SSs in interpreting students' test performances is that the scores have no immediately apparent reference for their meaning. Unlike GEs, which incorporate grade level into the score, SSs use a somewhat arbitrary scale, and parents and students will no doubt need help in interpreting them. To interpret the SS, the values associated with typical performance in each grade must be used as benchmarks. Thus, an SS of 261 on the ITED science test means that the student's science performance is about like that shown by the typical ninth-grade student tested in the spring.

In summary, although SSs are more difficult to interpret than GEs, they are more appropriate for measuring individual growth at the secondary level. In addition, SSs can be averaged for making group comparisons and for monitoring the change of grade groups over time.

Percentile Rank (PR)
A student's percentile rank is a score that tells the percent of students in a particular group that got lower raw scores on a test than the student did. It shows the student's relative position or rank in a group of students who are in the same grade and who were tested at the same time of year (fall, midyear, or spring) as the student. Thus, for example, if Toni earned a percentile rank of 72 on the Science test, it means that she scored higher than 72 percent of the students in the group with which she is being compared. Of course, it also means that 28 percent of the group scored higher than Toni. Percentile ranks range from 1 to 99.

A student's percentile rank can vary depending on which group is used to determine the ranking. A student is simultaneously a member of many different groups: all students in her classroom, her building, her school district, her state, and the nation. Different sets of percentile ranks are available with the ITED to permit schools to make the most relevant comparisons involving their students.

Types of Score Interpretation

An achievement test is built to help determine how much skill or knowledge students have in a certain area. We use such tests to find out whether students know as much as we expect they should, or whether they know particular things we regard as important. By itself, the raw score from an achievement test does not indicate how much a student knows or how much skill she or he has. More information is needed to decide "how much." The test score must be compared or referenced to something in order to bring meaning to it. That "something" typically is (a) the scores other students have obtained on the test or (b) a series of detailed descriptions that tell what students at each score point know or which skills they have successfully demonstrated. These two ways of referencing a score to obtain meaning are commonly called norm-referenced and criterion-referenced score interpretations.

Norm-Referenced Interpretation
Standardized achievement batteries like the ITBS and ITED are designed mainly to provide for norm-referenced interpretations of the scores obtained from them. For this reason they are commonly called norm-referenced tests. However, the scores also permit criterion-referenced interpretations, as do the scores from most other tests. Thus, norm-referenced tests are devised to enhance norm-referenced interpretations, but they also permit criterion-referenced interpretation.

A norm-referenced interpretation involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group. High standing is interpreted to mean the student knows a lot or is highly skilled, and low standing means the opposite. Obviously, the overall competence of the norm group affects the interpretation significantly. Ranking high in an unskilled group may represent lower absolute achievement than ranking low in an exceptional high performing group.

Most of the scores on ITBS and ITED score reports are based on norm-referencing, i.e., comparing with a norm group. In the case of percentile ranks, stanines, and normal curve equivalents, the comparison is with a single group of students in a certain grade who tested at a certain time of year. These are called status scores because they show a student's position or rank within a specified group. However, in the case of grade equivalents and developmental standard scores, the comparison is with a series of reference groups. For example, the performances of students from ninth grade, tenth grade, eleventh grade, and twelfth grade are linked together to form a developmental continuum. (In reality, the scale is formed with grade groups from kindergarten up through the end of high school.) These are called developmental scores because they show the students' positions on a developmental scale. Thus, status scores depend on a single group for making comparisons and developmental scores depend on multiple groups that can be linked to form a growth scale.

An achievement battery like the ITBS or ITED is a collection of tests in several subject areas, all of which have been standardized with the same group of students. That is, the norms for all tests have been obtained from a single group of students at each grade level. This unique aspect of the achievement battery makes it possible to use the scores to determine skill areas of relative strength and weakness for individual students or class groups, and to estimate year-to-year growth. The use of a battery of tests having a common norm group enables educators to make statements such as "Suzette is better in mathematics than in reading" or "Danan has shown less growth in language skills than the typical student in his grade." If norms were not available, there would be no basis for statements like these.

Norms also allow students to be compared with other students and schools to be compared with other schools. If making these comparisons were the sole reason for using a standardized achievement battery, then the time, effort, and cost associated with testing would have to be questioned. However, such comparisons do give educators the opportunity to look at the achievement levels of students in relation to a nationally representative student group. Thus, teachers and administrators get an "external" look at the performance of their students, one that is independent of the school's own assessments of student learning. As long as our population continues to be highly mobile and students compete nationally rather than locally for educational and economic opportunities, student and school comparisons with a national norm group should be of interest to students, parents, and educators.

Scores from a norm-referenced test do not tell what students know and what they do not know. They tell only how a given student's knowledge or skill compares with that of others in the norm group. Only after reviewing a detailed content outline of the test or inspecting the actual items is it possible to make interpretations about what a student knows. This caveat is not unique to norm-referenced interpretations, however. In order to use a test score to determine what a student knows, we must examine the test tasks presented to the student and then infer or generalize about what he or she knows.

Criterion-Referenced Interpretation
A criterion-referenced interpretation involves comparing a student's score with a subjective standard of performance rather than with the performance of a norm group. Deciding whether a student has mastered a skill or demonstrated minimum acceptable performance involves a criterion-referenced interpretation. Usually percent-correct scores are used and the teacher determines the score needed for mastery or for passing.

Even though the tests in the ITBS and ITED batteries were not developed primarily for criterion-referenced purposes, it is still appropriate to use the scores in those ways. Before doing so, however, the user must establish some performance standards (criterion levels) against which comparisons can be made. For example, how many math estimation questions does a student need to answer correctly before we regard his/her performance as acceptable or "proficient?" This can be decided by examining the test questions on estimation and making a judgment about how many the minimally prepared student should be able to get right. The percent of estimation questions identified in this way becomes the criterion score to which each student's percent-correct score should be compared.

When making a criterion-referenced interpretation, it is critical that the content area covered by the test -- the domain -- be described in detail. It is also important that the test questions for that domain cover the important areas of the domain. In addition, there should be enough questions on the topic to provide the students ample opportunity to show what they know and to minimize the influence of errors in their scores.

Most of the tests in batteries like the ITBS or ITED cover such a wide range of content or skills that good criterion-referenced interpretations are difficult to make with the test scores. However, in most tests the separate skills are defined carefully, and there are enough questions measuring them, to make good criterion-referenced interpretations of the skill scores possible. For example, the Sources of Information test covers too many discrete topics to permit useful criterion-referenced interpretations with scores from the whole test. Some skills, such as Government Sources, have only a small number of questions covering a broad topic. There may be too few questions to make sound judgments about mastery.

The percent-correct score is the type used most widely for making criterion-referenced interpretations. Criterion scores that define various levels of performance on the tests are generally percent-correct scores arrived at through teacher analysis and judgment. Several score reports available from Iowa Testing Programs include percent-correct skill scores that can be used to make criterion-referenced interpretations: Student Skills Analysis, Building Skills Analysis, Group Item Analysis, Individual Performance Profile, and Group Performance Profile.

Interpreting Scores from Special Test Administrations
A testing accommodation is a change in the procedures for administering the test that is intended to neutralize, as much as possible, the effect of the student's disability on the assessment process. The intent is to remove the effect of the disability(ies), to the extent possible, so that the student is assessed on equal footing with all other students. In other words, the score reflects what the student knows, not merely what the student's disabilities allow him/her to show.

The expectation is that the accommodation will cancel the disadvantage associated with the student's disability. This is the basis for choosing the type and amount of accommodation to be given to a student. Sometimes the accommodation won't help quite enough, sometimes it might help a little too much, and sometimes it will be just right. We never can be sure, but we operate as though we have made a good judgment about how extensive a student's disability is and how much it will interfere with obtaining a good measure of what the student knows. Therefore, the use of an accommodation should help the student experience the same conditions as those in the norm group. Thus, the norms still offer a useful comparison; the scores can be interpreted in the same way as the scores of a student who needs no accommodations.

A test modification involves changing the assessment itself so that the tasks or questions presented are different from those used in the regular assessment. A Braille version of a test modifies the questions just like a translation to another language might. Helping students with word meanings, translating words to a native language, or eliminating parts of a test from scoring are further examples of modifications. In such cases, the published test norms are not appropriate to use. These are not accommodations. With modifications, the percentile ranks or grade equivalents should not be interpreted in the same way as they would had no modifications been made.

Certain other kinds of changes in the tests or their presentation may result in measuring a different trait than was originally intended. For example, when a reading test is read to the student, we obtain a measure of how well the student listens rather than how well he/she reads. Or if the student is allowed to use a calculator on a math estimation test, you obtain a measure of computation ability with a calculator rather than a measure of the student's ability to do mental arithmetic. Obviously in these situations, there are no norms available and the scores are quite limited in value. Consequently, these particular changes should not be made.

Interpreting Test Scores | Scoring Services | Reporting Results

top


Copyright © The University of Iowa College of Education
ITP Home Updates ITP Online Tools eITP Downloads ITBS ITED About the Tests Using the Tests Interest Explorer Interpreting the Test Scores Obtaining Test Materials Obtaining Scoring Services Site Index IEOC Other Programs Contact Information Site Index The University of Iowa College of Education