Demystifying Exam Scoring: Translating Raw Scores to Scaled Scores

Liberty Munson (Microsoft)

As part of my "Dissecting Score Reports" series, I've had a lot of questions, comments, confusion around how questions are scored and how that translates to the score you receive. Here are the key points about question and exam scoring:

1) All questions are worth 1 point unless otherwise noted in the text of the question. We recently added some polytomously scored questions on several of our exams. What does "polytomously scored" mean? These are questions that are worth multiple points, and you can earn all, none, or some of those points. Usually a point is awarded for each action that you take. For example, if we ask you to match a word to its definition, you would get a point for each correct match. Currently, most polytomously scored items are really just several multiple choice questions that we've combined into a single question because this is a better experience for the test taker. Imagine if we had 4 word/definition matches that we wanted to test. It makes more sense to include all 4 in a single question rather than asking 4 separate questions to assess this knowledge. This is different from weighting. Weighting is an "all or none" proposition. You either get all the points for that question or none of them. We currently do not have any items like this on our exams.

2) Your final score is simply the sum of the number of points that you earned for each item. This is called a "raw score."

3) Your raw score is translated to a scaled score that ranges from 0-1000 using a simple mathematical conversion. How points are distributed across this range depends on where the passing score is set. Because we use 700 as our common passing score, the number of raw points below the passing score are equally distributed between 0-700 while the number of points above that score are equally distributed from 700-1000.

Why do we scale scores? Passing scores are not arbitrarily set! (Remember 700 does NOT mean 70%!) We use input from subject matter experts who review the difficulty of the questions in the item pool in relation to the skills and abilities of the target audience and provide guidance on where the passing score should be set. As a result, the actual number of questions that you have to answer correctly to pass may vary from one attempt to another if the difficulty of the question set changes. In other words, if you see a more difficult set of questions, it's hardly fair to expect you to be able to answer the same percentage correct as someone who sees an easier set of questions. Because of this, if we simply reported percentages, you wouldn't be able to compare your scores across time because a higher percentage on an easier set of items doesn't mean that you are doing better on the exam than a lower percentage on a more difficult set of items. By the way, this is an industry standard/best practice. If you take an exam and they don't provide scaled scores, the first question you should ask is "how are they ensuring that each administration is psychometrically equivalent and equally difficult?"

This is why you can't interpret your score as a percent of questions answered correctly, and it should never be related to a "grade" that you might get in school.

A (simple?) example:

Imagine that I have a 25 point exam. My scores across these items look like this: 1,0,0,1,1,1,1,3,0,0,1,2,1,0,0,1,1 (yes this is only 17 items but there are some polytomously scored items on this exams; the question where I earned 3 points was worth 5, the next question where I earned 0 was worth 3, and the last question was worth 3 points; all other questions were worth 1 point = 25 total points).

So, my raw score is 14 (add up the numbers). Imagine that the raw cut score that I needed to pass was 17. The score that I see on my score report is 576. The scoring table looks like this:

Raw Score Scaled Score Passing Status
25 1000    pass
24 962    pass
23 925    pass
22 887    pass
21 850    pass
20 812    pass
19 775    pass
18 737    pass
17 700    pass
16 658    fail
15 617    fail
14 576    fail
13 535    fail
12 494    fail
11 452    fail
10 411    fail
9 370    fail
8 329    fail
7 288    fail
6 247    fail
5 205    fail
4 164    fail
3 123    fail
2 82    fail
1 41    fail
0 0    fail

As you can see the scores from 0-16 are evenly spread across the 0-700 range and those from 17-25 are evenly distributed across the 700-1000 range.

What happens if the passing score changes to 12?

Raw Score Scaled Score Passing Status
25 1000    pass
24 976    pass
23 953    pass
22 930    pass
21 907    pass
20 884    pass
19 861    pass
18 838    pass
17 815    pass
16 792    pass
15 769    pass
14 746    pass
13 723    pass
12 700    pass
11 641    fail
10 583    fail
9 525    fail
8 466    fail
7 408    fail
6 350    fail
5 291    fail
4 233    fail
3 175    fail
2 116    fail
1 58    fail
0 0    fail

Again, the raw points below the passing are evenly distributed from 0-700, while those at the passing score and above are evenly distributed between 700-1000. Don't confuse this process with weighting. Each point earned on the exam is worth 1 point regardless of if it's earned through a dichotomously scored item (correct or incorrect) or polytomously scored item (multiple points possible), and those points are scaled through a mathematical conversion that allows for comparisons of your testing events across time. Even though it looks like points are given a weight in this process, they are NOT weighted. This is just math.

So, clearer? Muddier? What other questions do you have about this?

  • 8b345bcf-dd06-4c3b-a4cc-d960299793c4

    So, the scores are not weighted, they are just distributed across a differing mathematical range depending on the passing score which is scaled based on the difficulty of the question set.  Hmm, I am pretty sure that is actually the definition of weighting.

  • Demyan1
    | |

    Hi Liberty,

    What is the probabilty of getting the same score twice? I shared my statistically puzzling recent experience with 070-467 in this post

    Can you actually calculate the number (for 070-467) and, should it be sufficiently low, investigate if there could be an operational problem with the test scoring?

    Thank you.

  • Demyan1
    | |

    ... and no response.

  • Liberty Munson (Microsoft)
    | |

    Hi Dimitri,

    I thought I had responded to this. Getting the same score twice, although rare, is not unexpected. It actually demonstrates the reliability of the exam, which essentially means that all things being equal, a reliable exam is one that someone earns the same score across repeated attempts.


  • Demyan1
    | |

    I think what you meant was that candidates with the same (unobserved) level of preparation should be getting similar scores. When it's literally the same candidate taking the test, an i.i.d. assumption, to use a statistical term, would be inappropriate , by virtue of the candidate *learning* between the attempts. (Most people would bet on the repeat test taker, not on the first-timer).

    If the final-score scale is coarser than the raw-score scale - my reading of the passage starting with "Because we use 700 as our common passing scoreā€¦" is that you discretize the 0-700 and 700-1000 raw-score ranges into N buckets, ending up with 2N possible final scores - one could improve the raw score but not "get across" to the higher final-score "notch", but it's difficult to judge the probability without knowing how coarse the final-score scale is.  

  • Liberty Munson (Microsoft)
    | |

    Hi Dimitri,

    I don't completely understand your comment. The purpose of scaling scores is that you can actually determine how your performance has changed between attempts. Because the difficulty of the question set that you see in one attempt may vary from the next, reporting raw scores or percentages is meaningless in terms of your ability to see whether your performance has improved. A lower raw score or percentage on a more difficult set of questions might actually mean improved performance if the first set of questions was a lot easier. Scaling allows you to see improvements (or not) by putting all attempts on the same scale or metric.

    And, yes, you are technically correct about reliability. That's why I said "all things being equal, getting the same score actually suggests a reliable exam." We do know that candidates (well most of them) likely study between attempts so all things are not equal. Getting the same score multiple times is not typical or common as a result, but it does happen... usually because people focus on sections of the exam where that they were the weakest in the previous attempt and may spend less time/effort/thought on those that they did well on. Usually this happens because candidates do slightly better in one area and slightly less well in another.

  • Demyan1
    | |

    My score on attempt 2 should be comparable with my score on attempt 1, or with candidate X's score on attempt Y - no questions about that, i.e. about the purpose of weighting.

    My broad point is that you would expect improvement between attempts. To take a stark example, if you had tests A and B, and for test A, 90% of candidates re-taking the test improved their score, and for test B, only 50% did, you would want to understand what's driving the difference. (My guess would be: ill-defined exam requirements for test B) ... BUT of course I cannot make this claim about 070-467 based on a single observation.

    On the discretization comment, an example would be: I could score 101 raw points on attempt 1 and 199 raw points on attempt 2, but if the final score were based on 100-point-wide raw-score buckets - say, 1 to 100, 101 to 200, etc. - my final score would not improve.    

    Thank you for looking into this, I can start preparing for attempt 3 (almost) without fearing another 574 :)