Saturday, June 16, 2007

THE FALLACY OF "HARD" TESTS

(Ithacaunleashed.blogspot.com)

A great deal of fuss is often made about failing the bar exam. The news a few weeks ago was that Governor Patakis daughter passed the exam, but it is always mentioned that it was her second try. Similarly, John Kennedy, Jr. failed the New York bar exam twice, before finally passing it on his third try.

As one who took several medical licensure and specialist exams, and the Virginia bar exam, passing all, I might be inclined to pat myself on the back, but my former background as a mathematician won’t let me do that. I do remember, however, some remarks from a noted orthopedic surgeon about his own specialty exam: “It was a hellishly hard test, and went on for hours,” he said, ”but I’m really glad I passed the first time I took it. Only about 35 percent who took it passed the exam.”

He was describing, with only the slightest tinge of boastfulness, the qualifying exam for specialists in orthopedic surgery. Passing the exam entitled one to join the “college” of orthopedic surgeons, and list oneself as specialist.

“Was it all multiple choice?” I asked. “And how did they grade it?” I was thinking of my own exams. “Did they count only the right answers.?”

When he said Yes to all the questions questions, I did not have the heart to tell him what I knew as a mathematical certainty—that the exam was, like most graduate medical exams, and large parts of legal licensing bar exams in most states , virtually a complete fraud.

The reason these tests are fraudulent—and the harder they are, the more they are fraudulent—is that for an extremely difficult test graded in that way, guessing tends to count much more than knowledge.

A simple example will describe why this is the case. To illustrate this, consider an extreme case.

Suppose you and I take a test, and you know twice as much as I do. For simplicity (this is the extreme case) suppose the test consists of 100 questions, each True or False, and moreover (this is the key point), let us agree that the test will be graded by only counting the number right.

Naturally, both of us will guess at an answer for those questions that stump us.

Now suppose the test is very hard. As hard as it could be actually. Suppose the test is so hard that I, with lesser knowledge, can only answer one question based on actual knowledge. I answer that question, and guess at the other 99. You, who know twice as much as I, can answer two questions based on knowledge. So you guess at 98 answers.

As you can readily imagine, the odds of you getting a higher grade than I are very slight. In fact, over 45 percent of the time, in repeated trials, I would outscore you, even though my knowledge is half that of yours.

I chose a True-False test for this example, but it doesn’t make any real difference were the test to be multiple choice with several choices in each question. The only thing that makes a difference is how hard is the test. Your advantage would grow substantially as the test was weakened.

For further example, if the test was so easy, and you so well-versed in the subject that you could get a perfect score, and I knew half as much, I would answer 50 questions based on knowledge, and guess at 50. In the long run, I would get half of those 50 correct, for a final score of 75. So you get 100, and I get 75, on the average.

Were the test to be multiple choice, with four choices for each question, and your knowledge was also 100 percent and mine half that, I would then (guessing at 50) get a score of 50 + (1/4 times 50), or 62.5. on the average.

These extreme cases demonstrate the point, that truly hard multiple choice tests, graded by counting only the number right and ignoring guessing, are fraudulent.

But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.

For True-False exams for example, the number subtracted would most likely be (Number Wrong ÷ 2). Let’s see how that would work out, for the sample case above. You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5, while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25. Here there is a substantial difference between our scores, closer to the two-fold difference in our actual knowledge.

The situation is only a bit more complex for multiple choice tests with four or five questions, and you can readily calculate the variation between the knowledgeable you, and the ignorant me. As an old math teacher might say, we leave that for the reader to work out by himself or herself.

71 comments:

Ashwin Dixit said...: Here are a few ways the reasoning could be inaccurate:

It is difficult to quantify human knowledge a priori. How much one knows is an unknown quantity. The test tries to take a first, approximate, measurement of this unknown quantity. Unmeasured, one person's knowledge can't be compared to another. The statement -- "I know twice as much as you about this subject." is absurd when applied to humans.

The statement -- "I know twice as much as you about this subject." is meaningful when applied to AI's. Cyborg A could have 512 Terabytes of data on a given subject, whereas Cyborg B has only 256 Terabytes.

Even assuming that test-taker A "knows twice as much" as test-taker B -- the rates of success for their wild guesses are unlikely to be equal. There is such a thing as an "educated guess". Test-taker A's guesses should outpace test-taker B's guesses as a non-linear function of their "knowledge differential".

I enjoyed the article, and the meticulous reasoning.

Cheers,

Ashwin.

http://www.livenudejournal.com/; June 16, 2007 at 4:39 PM
Unknown said...: Two things:
2 + 49 - (49 ÷ 2)=26.5, not 75.5
Also, they normally subtract so that the expected value of guessing is 0. For a true-false test, they would subtract the number of wrong answers. In that case, the results would be 2 and 1.; June 16, 2007 at 11:29 PM
Unknown said...: "I know twice as much as you about this subject" doesn't need to be easy to measure for the reasoning to hold. The post asserts for rhetorical purposes that these two people exist. If they did, a proper test would be able to rank them with a high probability of correctness. The "hard" test ranks them incorrectly 45% of the time, so it is a fraud.

The statement is also a shorthand for a more complex one. Something like "given a pool of questions with average hardness=0.5, I will be able to answer twice as many questions with certainty as you. Both our scores (only counting "certain" answers) are directly proportional to the hardness." That is an oversimplification of real people, but if the test can't work with in the simplest situation, it can't be depended on to work in more complex ones, either.

I don't think "educated guesses" affect the outcome that much, either. x probability-y answers are equivalent to z certain answers. (The exact relationship between x, y, and z will depend on the odds of a successful uneducated guess (the number of choices, in other words) and the penalty for an incorrect answer.); June 16, 2007 at 11:55 PM
Chris said...: The other thing is that leaving a question blank incurs no penalty, which the article does not address.; June 17, 2007 at 12:01 AM
Unknown said...: Let's say that the threshold for pass/fail is 90% of the questions right. If I pass with a 90%, and you know half as much as me, you'd get a 45%. Even giving you some credit for guessing, there would still be a wide margin between us.

Your case holds where a multiple-choice exam has a low percentage correct to pass, but there are many cases where it wouldn't.; June 17, 2007 at 12:08 AM
Roshan George said...: In multiple choice tests that I've needed to take until college, it's usually this way:
3 for a correct answer, -1 for a wrong one. And that's for choosing one out of 4.; June 17, 2007 at 12:34 AM
Piepmatz said...: I've had my share of multiple choice tests in a couple of universities here in Germany - in every one, you got points deducted for wrong answers. You couldn't get a negative score for any question, but I was hard enough: Usually, there were four or five possible answers, ans you never knew how many were correct. It is pretty unforgiving, but guessing takes you nowhere.; June 17, 2007 at 12:41 AM
Unknown said...: This comment has been removed by the author.; June 17, 2007 at 12:58 AM
Unknown said...: I agree with what you're saying in theory but I don't agree that the Bar exam (at least in NY) is a good example of a "fraudulent" exam (except in the sense that it measures a body of knowledge that most attorneys will never use). There are very very strong correlations between class rank in law school and Bar exam passage rates which should not be the case given your hypothesis.; June 17, 2007 at 1:00 AM
Unknown said...: This comment has been removed by the author.; June 17, 2007 at 1:25 AM
Unknown said...: On the true-false test where a guy gets 2 answers right and guesses at the other 98, getting 49 right and 49 wrong, with a 0.5 point penalty for each, his score is 26.5, not 75.5.; June 17, 2007 at 1:33 AM
MSM said...: The math in the last paragraph is unfortunately wrong.

You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5

As John correctly states this is not 75.5 but 26.5.

while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25

Well first of all 100 - 1 is 99. so you would be guessing 99 and not 97. So the correct result is

1 + (99 ÷ 2) - ((99 ÷ 2) ÷ 2))

which is 25.75.; June 17, 2007 at 1:36 AM
Unknown said...: Linux, what you've said is fully consistent with the post. Your example of a 90% threshold works because it's closer to the "easy" extreme than the "hard" extreme - meaning that the minimally qualified candidate is able to answer more questions than not. It is the "hard" tests with no penalty for incorrect answers that the post claims are fraudulent.; June 17, 2007 at 1:37 AM
J.C. said...: These kind of test are a scam but not for the reason you're stating. They are designed this way because evaluators are so lazy (or lack the time) to actually do an interview or to grade a test based on short-answer or short-essay questions. Those are more comfortable tests to the candidates, which also take into account not only knowledge but also creativity and verbal skills. I'm about to finish my PhD and I have always despised (and actually flunked my share of) the multiple selection question based exams.; June 17, 2007 at 1:49 AM
doctorfrog said...: In my military training school I was doing very well and decided to test a rumor I had heard, so I deliberately answered every question incorrectly. Lo and behold, I was awarded a score of 100 for my efforts! :); June 17, 2007 at 2:13 AM
874 said...: The calculation at the end seems wrong. The knowledgeable exam taker would know the answer to two questions, and correctly guess at 49. He would get half of 49 subtracted, which is 24.5. He'd get a total score of 2+49-24.5=26.5.

The ignorant taker would know the answer to one question, and correctly guess at (on average) 49.5. He'd get half of that subtracted, which is 24.75. His total score would be 1+49.5-24.75=25.75.

This makes for a total difference between the half-as/twice-as knowledgable test takers of 0.75 points on a 100 point scale.; June 17, 2007 at 2:17 AM
Unknown said...: I am sorry but I think that you are oversimplifying. As the number of options increases beyond true/false the chances of guessing your way true it drops dramatically. For example if there are four options and you have to be correct on 16 out of 20 the chances for guessing your way through is 1:2600000. In the equivalent test with just two options your chances would still be slim: 1:170.

Multiple choice tests do make sense no matter how difficult they are, but care must of course be taken to ensure that you cannot pass simply by guessing; June 17, 2007 at 2:28 AM
Unknown said...: multiple choice tests are a lot more insidious than this.

my mother was a sucker for educational toys and stuff.
so I got dumped in everything
that could be bought for reasonable money.
One thing was a puzzle set that could be assembled in various ways.
You got a chit of multiple choice questions on a certain topic ( Q-No/puzzlepiece-No --> 4 placement numbers of that
piece on a tableau) and if all questions were correct the pieces would all fit.
This trigger / response training trains you to recognize the pattern of the propper answer, it does not necessarily train you about the
right knowledge ;-)

I am still very good at MC tests even if i haven't the faintest ideas of what it is the questions are about.; June 17, 2007 at 2:43 AM
Unknown said...: In Holland, the calculation of multiple choice scores begins with substracting the guess score.

We calculate the avarage "guess score", which will be graded a 0. Every correct answer above the minimum guess score would give you points, totalling to 10 for all answers correct. No penalty for false answers, need 6.0 to pass.

In the case of only two options, getting 50 correct out of 100 means a 0, and 0.2 points for each correct answer above. 75 correct would be a 5.0, 80 is a 6.0 (which you need to pass).; June 17, 2007 at 2:54 AM
brendan said...: perhaps spend a minute proofing for math before making a math argument.; June 17, 2007 at 3:16 AM
Frank Ch. Eigler said...: A "hard" test is not necessarily difficult because the expected number of correct answers is a tiny fraction - below the random. It could be because the answers take stressful effort to find, but test takers still manage to get most of them. By basing this posting on the former silly definition of "hard", the whole claim falls apart in a straw man.; June 17, 2007 at 3:16 AM
Unknown said...: If you aren't already aware of this, your article is on slashdot: http://science.slashdot.org/article.pl?sid=07/06/16/2248238&threshold=5; June 17, 2007 at 3:18 AM
Unknown said...: The 45% example is quite flawed.
Actually, I wouldn't consider it that much a fraud if a person that knows 1 more (2-1=1 for the mathematically inclined ;) ) answer is only 5% more likely to pass.
Both a) know nearly the same, so of course the probability that they will pass is nearly the same (why not?!), and b) will fail anyway, so it is not even an interesting case (even if you considered it unfair).

More interesting versions of the question are "Among _all_ of those passing, how many persons with more knowledge failed, respectively" if you do a pass/fail test or "Among all those participating, how many persons with less knowledge did rank better, respectively", if you do a ranking test.

If form the mathematical equations for the probabilities, you will see that with an increasing number of questions, the numbers will become smaller and smaller.
I'm not sure how big they are for 100 questions, but I don't really want to calculte this now ;); June 17, 2007 at 3:29 AM
Unknown said...: Uh nonsense 5% more likely, of course it is twice as likely ;); June 17, 2007 at 3:34 AM
catprompt said...: what exactly do you mean by 'guess'?
a dart thrown at the phone book, or using some criteria to eliminate choices?

your model presumes that you either textbook know the answer or you don't have any idea, when in fact certainty of an answer is likely to be less well defined than that.

when i was in helicopter school (army) we were encouraged to look up answers in the relevent tech manuals, based on two ideas.

the first idea is that many people will try to end-run cheating rules, which if done successfully, make an impression on the cheater that seems to actually help them remember the answers.

the second, which really does make sense, is that knowing where to find the answer is only one step away from having the complete answer. that's a concept applicable to civilian knowledge as well.; June 17, 2007 at 3:56 AM
Unknown said...: I am sorry, but as a psychometrician (i.e. someone who writes multiple choice tests and interprets the results), I have to simply chime in with this:

We know. That's why we don't just count correct answers.

Any major test (GRE, LSAT, TOEFL, TOEIC, etc.) uses some kind of item response theory (IRT) to determine the score. This means that the final score is actually the person's ability, given their performance on the items, which are weighted differently (to put it VERY simply) according to people's performance on them. It doesn't matter what easy-to-read numbers the test gives you as your score; your REAL score is a number between 0 and 1. Sometimes that number is rescaled to the actual number of items that were on the instrument to give people the illusion of a classical MC test.

Another point is this: Remember when you took your SAT (I think it was)? They told you not to guess if you weren't sure about that answer, right? The reason for that is that with a really well-worn and robust test, the developers have been able to figure out who picks which distractors, and can therefore derive further meaning from whatever option you choose. So instead of a simple binary item (right or wrong), they can create a partial-credit item. Say "A" is the right answer, but people who are pretty smart seem to pick "B" a lot. So maybe the stats will assign a value of 0.5 for that one. Maybe "C" is just a throwaway distractor and doesn't mean anything other than you missed the question. But what if "D" turns out to really distract total morons? The stats might end up assigning a NEGATIVE value if you pick that. So read the test specifications before you take a big test. If they say not to guess, that's why. What you don't know can actually hurt your score more than just skipping it.

Look into the Rasch model and multi-parameter IRT. It's late and I actually need to develop some questions tonight (no kidding!), so I leave it to you and Wikipedia.

So to sum up: Basically, you are right about the problems with MC tests, but wrong about how much this affects people's lives.; June 17, 2007 at 4:06 AM
Unknown said...: I think you got a error in your math.

If you got at multi choice test with n choices. and k,g,r and w be the known. guessed, right and wrong answers..

We have on average that
r=k+g/n and w=(n-1)g/n;

which gives
k=r-g/n=r-w/(n-1)

So, if for 2 choices you
get k=r-w and not k=r-w/2
as you write.; June 17, 2007 at 4:22 AM
Mal said...: This is a very insightful article once you make an effort to actually understand what the author tries to say: that making MC tests harder and harder renders them more and more worthless. The key is to accept the author's definition of "hard" and understand that he's making up an extreme case which must not necessarily apply to any MC test that is difficult.

Another unfortunate truth is that no test is ever fair. Oral and written exams (those where you write full-text answers) are not completely fair either.; June 17, 2007 at 4:24 AM
Mark Craig said...: With intelligence tests and also, I'll wager, the sort of qualifying exams to which you refer, the raw numeric score is never meaningful in and of itself; it is only meaningful when it is scaled relative to the scores of some or all of those who have previously taken those exams. In the case of intelligence tests, it's not the numeric score that demonstrates your abilities, it's the "percentile" into which that score falls on the human Bell Curve. It's your performance relative to the aggregate of all other humans that is meaningful, not the absolute raw score.; June 17, 2007 at 4:32 AM
Richard Strong Bowen said...: Your choice of the two only answering 1 or 2 questions correctly strongly skews this. On a 100 point, 4-choice exam where Alice knows 20 and Bob knows 10 problems, with no guessing penalty, run 1000 times, Alice gets a higher score 757 times and Bob 207. So, you've chosen your numbers to make the problem look bigger than it is.; June 17, 2007 at 4:37 AM
Unknown said...: what a laughable post!!; June 17, 2007 at 5:41 AM
Valentyn Bykov said...: I think the best solution to this problem was done in my math exam. There were always 5 answers to every question and you get 4 points for answering correctly and 1 point for not answering. So if I'm guessing, I will on average get 4 * (0.2 - probability of successful guess)= 0.8 points. This is 0.2 points less than if I would choose no to answer at all.

I like this system and I think it does a pretty good job eliminating guessing...; June 17, 2007 at 5:52 AM
Najiib Azad said...: They test luck too... since luck is useful in the medical and law field. Almost every biologist I've talked to believe is luck.; June 17, 2007 at 6:28 AM
Algebraic said...: You're confusing two things: percentage of people who pass the test, and percentage of correct answers required to pass the test.

To make very few people pass, you set a score cutoff at the mean plus a standard deviation or two. Note that the mean can be almost anything, and defines the proportion of correct answers required to pass. Except at the extreme boundaries, the mean and the variance need not correlate at all.

Your "fallacy" is fallacious unless you can show that these MC tests have a low *correct response rate*; the *pass rate* is irrelevant to your argument.

linux said something like this before, but it needs to be said again.; June 17, 2007 at 6:30 AM
Chris said...: Hogwash; June 17, 2007 at 6:59 AM
Unknown said...: Algebraic said it more succinctly than I could have, but here was my response:

http://science.slashdot.org/comments.pl?sid=238713&cid=19540721; June 17, 2007 at 7:31 AM
Robbie said...: As someone who is trained in psychometrics I'd like to add to the psychometrician's comment above that the primary measure of the effectiveness of a test is its predictive validity. If the analysis of the performance on these exams effectively sort out those who will do well in the career from those who will not then it is not fraudulent and is indeed valid.; June 17, 2007 at 7:40 AM
Justin said...: You're also treating the test as if the questions are of uniform difficulty. I don't have time to do any sample calculations, but that will pretty obviously change the scenarios you were envisioning (so that differences in knowledge have a more substantial effect on the outcomes).; June 17, 2007 at 8:07 AM
Unknown said...: If we think of human knowledge on a particular subject as a one-sided continuum from 0 to infinity, and represent individual knowledge using a negative exponential probability function where the integral of said function represents the probability of knowing the answer to question in a particular difficulty range, and adjust said function so that someone with "twice as much knowledge" has twice as much area under their knowledge function, then we get a much different story.

For problems within any given difficulty region, there exists a ratio of the probability that the smarter person knows the answer to the probability that the dumber person knows the answer. What we see is that this ratio grows as the difficulty increases. Thus, if you ask very easy questions ("what color is the sky?") the probability ratio is nearly 1:1 - that is, both people will likely get the answer, and know difference will be shown. However, if a difficult question is asked ("what wavelengths bend around earth's atmosphere to create the blue sky") the ratio is much different. Asking difficult questions yields a much greater probability of showing difference than asking simple questions.; June 17, 2007 at 8:53 AM
Unknown said...: A test is an attempt to estimate someone's ability, and a true-false (or multiple choice) test is an attempt to create an objective test, which can be shown not to have test-taker bias. The answers are graded simply on the basis of whether one has correctly selected the pre-chosen "right" answer for that particular question. A true test to determine someone's actual ability would be an essay, where the person can describe their answer and thus someone can get a "correct" response even if they might not answer a fixed-choice test correctly. The way a question is written can have an effect on the answer given; the biases in creating surveys has well established this point. If the question is poorly written or the way the answer is structured is ambiguous, it is entitely possible that one can know the correct answer but give an incorrect response.

The problem with an essay-style question is that grading the test brings in the biases of the person who does the grading. It's also more labor intensive than a fixed-choice test (which can be administered either wholly or partially by machine, or the test can be graded by anyone with the answer sheet and the test-taker's responses); grading an essay means the person doing the grading has to actually know the subject.

If the intent is to have, again, objective tests that won't have biases or allow automated testing, one answer is to have more comprehensive testing on a wider scale where the answers (and responses to the answers) are clear and unambiguous so that the questions can be answered by someone who knows the particular subject, but is otherwise of normal intelligence. This also means that the people who write the tests (and the responses) need to not only know the subject, but need to be extremely proficient in the English Language.
--
Paul Robinson — My Blog
"The lessons of history teach us - if they teach us anything - that nobody learns the lessons that history teaches us."; June 17, 2007 at 9:12 AM
Bryan Seigneur said...: Al Feldzaman said:

You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5,

Wait a minute, you have to explain this some more. 2 + 49 - (49/2) = 26.5.

while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25.

Huh? Interpolating from above, I thought the equation would be 1 + 48.5 - (48.5/2) = 25.25. That supports your thesis, but you wrote it totally wrong.; June 17, 2007 at 10:00 AM
BetterSecurityTools said...: One important point not mentioned so far is the presence of multiple-choice questions that include answers "none of these" and/or "all of these".

It is far too easy to find some logical or semantic problem with each of the answers, making "none of these" the most-correct answer.

How many MC questions and presented answers can withstand thorough scrutiny to insure the lack of any logical or semantic shortcomings?

Any MC exam with "none" or "all" as choices are more than fraudulent -- they penalize the test taker who understands logic and contextual expression better than the person who wrote the exam.

My personal experience is that exams explicitly screened for these defects are by far the most frequent.; June 17, 2007 at 11:57 AM
tmc said...: Wow, Al. Doctor, specialist, lawyer, and you still trip over simple arithmetic. You're a real Renaissance twit.

But that could happen to anybody. A couple significant things point to a complete lack of (or failure of, anyway) mathematical training, though. One, you'd naturally reduce to the trivial case, "so hard we know 0 answers."

Second: you've spent considerable effort to show rewarding random guessing is bad test design. Most high school freshmen would find that obvious.

Don't take this personally. If you aren't tarred and feathered, it makes real mathematicians look bad.; June 17, 2007 at 1:37 PM
frnksntn said...: If you only know enough to be able to answer two percent of the test, you do not deserve to pass it.

If the number of questions one person can answer more than another is only about 1%, that is not a substatial difference, and the victor will be understandably random.

Is this why you are a FORMER mathematician?; June 17, 2007 at 3:17 PM
Unknown said...: Dear Al,

It is true that too-hard exams are problematic, and that passing them is influenced more by chance than by ability or knowledge. But the reason is not multiple choice, or counting only the right answers. The reason is that with such a hard test, a weaker student could by chance know more questions in the test that a stronger student, because even a strong student can only expect to answer about one question in the exam. The solution for this is use exams where the good students can answer many questions, say 50% of the questions. I'll post a worked-out example in the next comment.; June 17, 2007 at 4:29 PM
Unknown said...: Here is a worked-out example why hard tests are bad.

Let's imagine an ideal situation: when I see a multiple choice question, I know for sure whether I know the answer or not. In case I don't know the answer, the multiple answers don't give me any clue, so I cannot do better than guess. Now suppose that wrong answers are heavily penalized, so it's not worth guessing. Under these assumptions, everyone will only answer the questions they really know, and won't guess at all. We will now see that even in this case chance plays a role.

Now suppose there are 100 questions on the test, but they are chosen from a very large pool of questions. Consider two people - one has a chance of 1 in a 100 to answer a random question in the pool, while the other has a 1 in 200 chance.

If the 100 questions in the actual test are chosen randomly in the pool, then the probability that the
1 in 200 guy will do at least as well as the 1 in 100 guy is more that 25%.; June 17, 2007 at 4:30 PM
Jeremy said...: This comment has been removed by the author.; June 17, 2007 at 5:38 PM
Jeremy said...: (Formatting was screwed up, so I'm reposting.)

1) You need to read up on 3-parameter logistic scoring models. They do, in fact, estimate the effect of guessing (called pseudochance) on final scores.

2) Many medical exams, particularly the simulated steps exams, are scored on a Bayesian inference network, which do not suffer from your simpleton self-reasoned arguments.

3) You, a non-specialist, stating these “certainties” is like me telling my oncologist how I *know* I have severely advanced lymphoma. I may be right, but if I am, it’s a lucky guess.; June 17, 2007 at 5:40 PM
BJ said...: I multiple choice questions for tertiary entrance for a living. All questions in high-stakes tests are trialled before being used as scoring items. This is the case with most companies that develop tests. We get statistics that correlate the abilities of the people who chose each option in every question with how they went over all. Where there is a positive correlation with, say, the 25% who chose the 'wrong' option and the top 25% of the candidates over all, that question is dropped. Where there is a negative correlation between the 40% who chose the 'right' option with the top 40% over all, that question is dropped.

While it is true that tests developed without trialling and statistical checking are (usually) flaky, most professional tests have their results heavily scrutinised, both before the question becomes a scoring item, and after.

If a question is too hard, then what happens is that roughly equal proportions of the cohort choose each option, and there is no correlation between any of those population fractions and that percentile at the top of the range. Even if the question is a good one, if it is too hard for the population then the stats produced for it demonstrate that most are guessing, therefore the question does not rank accurately, therefore the question is discarded and does not make it out of the trialling phase.

The stats for all questions keep being monitored even if they make it through trialling.

Regardless of the fact that the questions themselves are developed by a team of experts in each relevant field, argued over for a period of a year prior to even being trialled (are all the false options truly false, is the correct option exactly right?); regardless of this or how sure we may be that the question is a good one, if the stats show us that it is not working for the purposes is was designed for (to accurately and reliably rank a given population according to their ability level on a given variable) then the question is discarded.

I can't paste in an example of the stats we get without risking my job, but a team of specialist statisticians and mathematicians provides us with an incredible amount of detail on the results of every single option - questions that are too hard to be accurate in discriminating do not get used.

Bottom line: while you are correct about the problems with questions that are too hard, professionally developed tests are written by people who are not unaware of this problem.; June 17, 2007 at 6:00 PM
chispita said...: I study an Engineering degree in Spain, and the way we deal with multiple choice is slightly different. First of all, there are no exams that are all multiple choice (OK, there is one, but it's a special case and I've never understood the way they grade it as it is waaaay too complex). What we usually get is that the multiple choice (if there is one) is just part of the exam, that counts somewhere between 10-40%.

The thing is, as engineering is mostly solving problems, it is in the multiple choice that we get tested on theoretical stuff, so it is more complicated (at least for me, as I'd rather do two pages of calculations that memorize stuff). Also, we do get negative scores. For each question in the multiple choice part getting it right is +1, and getting it wrong is either -1 or -0.5 (depends on the teachers). So guessing is not a good idea unless you have four options, know what the question is about, have been able to eliminate two of them, and just can't decide between two others which are very very similar.

This has made me *hate* multiple choice. Because the way they do it makes it difficult. Unless, of course, you happen to get lucky and have a lazy professor who just "recicles" most of the questions from other years' exams (we do have access to corrected exams from previous years, something which I think does not happen in the US).; June 17, 2007 at 8:39 PM
THE chimp said...: If A knows twice as much as B, and A gets a perfect score, it doesn't mean B will get 50. Nothing prevents B from exceeding the 50 limit. In fact, if the test if really easy (as you presumed), B can get up to 100 as well, while A can't get any higher due to the ceiling effect. Now do you still think easy tests are always better than hard tests?; June 17, 2007 at 8:46 PM
Stopher said...: How about if these tests measured competency in the given field rather than some arbitrary score vs. other people in a non real world situation. I don't care what percentile my doctor or lawyer scored on the scan-tron. I want them to be good at sewing me up or making a legal argument, not filling in circles with a number 2 pencil.; June 17, 2007 at 9:13 PM
Anonymous said...: Your reasoning is about as convoluted as a heterosexual moving his family to the Castro district of San Francisco.

Mitch Haase; June 17, 2007 at 10:38 PM
Dave said...: on June 17, 2007 2:17 AM Daniel said...
"The calculation at the end seems wrong. The knowledgeable exam taker would know the answer to two questions, and correctly guess at 49. He would get half of 49 subtracted, which is 24.5. He'd get a total score of 2+49-24.5=26.5."

Maybe most everyone here should take a step back and think again from first principles. In a multiple choice test a correct guess is indistinguishable from an answer based on knowledge - that is one of the fundamental flaws - so both will score equally. Forget arithemtic (looks like many here have, so maybe that's a redundant request). Another fundamental flaw is that there is only ever one "right answer" and that answer is constrained by the knowledge and attention brought to bear by the exam setters. A candidate who knows more (or better) will necessarily lose marks. The classic "intelligence test" question "what is the next number in this sequence?" shows this up - the true answer is "any number you choose, depending on the complexity of the sequence generation algorithm", but that is never given as an alternative, so someone who really understands mathematics will score low.

The real reason for multiple choice sweeping the board is that the papers can be machine marked, which saves effort and allows droves of people to be "processed" through "qualifications". The most important result is the rise of mediocrity, as individual brilliance impossible to demonstrate using multiple choice tests. Remember that one meaning of "qualify" is "to limit or restrict".; June 17, 2007 at 11:17 PM
Coyote said...: So what we're discussing here is what chance a person has to succeed at a test when they don't actually know the material, because they are guessing.

Indeed, that should be an issue, because you do not want unqualified people to pass tests by guessing instead of knowing the material.

But if you do know the material, you are more likely to succeed on the test.; June 18, 2007 at 12:24 AM
James said...: Dave includes an important point in mentioning the next number in a sequence problem.

The usual problem (not just in multiple choice) for me is that poor phrasing of the question. If you are a person who reads precisely what is written you can often find you don't actually know what they are asking and have to interpret it.

Others who have trained to pass the exam rather than understand the subject can usually match the question to the trained answer rather than draw on the knowledge they will need to apply in the real world.

It's a problem I suffered a couple of weeks ago doing a law exam. So yes sometimes knowing less is an advantage.; June 18, 2007 at 4:06 AM
Anonymous said...: About 35 years ago I walked in off the street after a 3 day aviation seminar covering the written exam for Airline Transport Pilot, and was signed off as qualified to test. I took the FAA examination for Airline Transport Pilot and passed with a score of 92%.

At the time I was studying to become a professional pilot and was highly motivated and had not yet passed the Private Pilot written or flight exams. I think I had a total of around 8-9 hours of dual flight time under my belt.

Anyway, the FAA refused to give me my test scores because they said I wasn't a pilot, and only qualified pilots were allowed to take the ATP exam. Well, I demanded the test scores and finally did receive them although it took 3 months.

It was not my intention to buck the system; (Lord Knows I do it every time I can) but only to test myself. The FAA went ballistic and said I was not qualified to operate a large transport aircraft ...NO SHIT HOLMES!

Since then, after setting this precedent and causing the FAA Apoplexy, the FAA quickly modified and rewrote the testing requirements and regulations covering Part 121 Airline flight operations as well as the experience requirements, and now you can't just walk in and take the ATP test with ZIP experience.

It's like every profession, Law, Medicine, garbage collection, etc. Passing tests does not make you qualified! After a few years of commercial air taxi flying in small jets, I finally figured it out and decided NOT to become an Airborne Bus Driver (airline pilot) flying a "milk run" for some soon to be deceased airline, and instead went into Part 135 charter, and corporate operations. I never bothered to take the ATP flight exam and settled on a Commercial Pilot certificate with Instrument and Multiengine ratings, and a few type ratings in large turbine aircraft like the B-727-200 and DC-9 30-61 series.

Tests only confirm you MAY be able to do this stuff but in no way are you qualified to do anything serious with a piece of paper with fresh ink on it. Now I am retired from the aviation circus and wouldn't do it today on a dare. I flew for 25 years with no violations or accidents on my record and accumulated over 12,000 hours of Pilot in command time. It was more of an adventure never knowing where you were going that day or what perils you would face regarding weather and other issues.

It's a given that there are MANY incompetent Physicians and lawyers out there screwing up big time, and even more lawyers to get you out of trouble; but when you screw up in an airplane the results can be headline material. One has to remember that the pilot is usually the first at the scene of the accident!

I guess thats why it's called practicing medicine or law. One does not in general practice aviation, one demands perfection of oneself, as your screw ups are all to evident!; June 18, 2007 at 5:44 AM
Anonymous said...: About 35 years ago I walked in off the street after a 3 day aviation seminar covering the written exam for Airline Transport Pilot, and was signed off as qualified to test. I took the FAA examination for Airline Transport Pilot and passed with a score of 92%.

At the time I was studying to become a professional pilot and was highly motivated and had not yet passed the Private Pilot written or flight exams. I think I had a total of around 8-9 hours of dual flight time under my belt.

Anyway, the FAA refused to give me my test scores because they said I wasn't a pilot, and only qualified pilots were allowed to take the ATP exam. Well, I demanded the test scores and finally did receive them although it took 3 months.

It was not my intention to buck the system; (Lord Knows I do it every time I can) but only to test myself. The FAA went ballistic and said I was not qualified to operate a large transport aircraft ...NO SHIT HOLMES!

Since then, after setting this precedent and causing the FAA Apoplexy, the FAA quickly modified and rewrote the testing requirements and regulations covering Part 121 Airline flight operations as well as the experience requirements, and now you can't just walk in and take the ATP test with ZIP experience.

It's like every profession, Law, Medicine, garbage collection, etc. Passing tests does not make you qualified! After a few years of commercial air taxi flying in small jets, I finally figured it out and decided NOT to become an Airborne Bus Driver (airline pilot) flying a "milk run" for some soon to be deceased airline, and instead went into Part 135 charter, and corporate operations. I never bothered to take the ATP flight exam and settled on a Commercial Pilot certificate with Instrument and Multiengine ratings, and a few type ratings in large turbine aircraft like the B-727-200 and DC-9 30-61 series.

Tests only confirm you MAY be able to do this stuff but in no way are you qualified to do anything serious with a piece of paper with fresh ink on it. Now I am retired from the aviation circus and wouldn't do it today on a dare. I flew for 25 years with no violations or accidents on my record and accumulated over 12,000 hours of Pilot in command time. It was more of an adventure never knowing where you were going that day or what perils you would face regarding weather and other issues.

It's a given that there are MANY incompetent Physicians and lawyers out there screwing up big time, and even more lawyers to get you out of trouble; but when you screw up in an airplane the results can be headline material. One has to remember that the pilot is usually the first at the scene of the accident!

I guess thats why it's called practicing medicine or law. One does not in general practice aviation, one demands perfection of oneself, as your screw ups are all to evident!

I never received a high school diploma only a GED in the Army, but I was saddled with a 163 IQ at age 18 which more often than not caused me trouble in life.

Believe me, it's difficult being smarter than you employer and having a collossal ego to match!; June 18, 2007 at 6:00 AM
csours said...: The comments on this post reflect poorly on 'experts' social skills.; June 18, 2007 at 7:43 AM
Kurt Schroeder said...: The best tests I ever had were in a literature class I had in college. Each test was about 5 questions. The questions on each bit of required reading were straightforward, and if you had read and understood the material, you would pass easily. if you had not read, you were doomed to fail. I got a lot of 100%s and a lot of 0% but not much in between. Thank you Roger Lips!; June 18, 2007 at 7:49 AM
JGPC said...: Not to be cynical, but you said:
"Governor Patakis daughter"

Please, for the sake of not being fallacious, lets get his title correct, Former Gov. George Pataki.; June 18, 2007 at 7:58 AM
Unknown said...: Check out:
http://www.brics.dk/~mis/multiple.pdf

A couple of researches at DAIMI in Aarhus (Computer Science, Aarhus, Denmark) made a type of multiple-choice test where the expected score of guessing was 0.
The link describes the metod

/Søren; June 18, 2007 at 8:38 AM
Konstantin Augemberg said...: Most of licensing tests are developed and tuned by psychometricians, who take in consideration the "guessing parameter" and other characteristics. As a mathematician, you must have heard about Item Response Theory (IRT) and modified Rasch models, so you should know what I mean. Measure of latent trait (in this case, knowledge of the subject, or ability) is obtained after taking in consideration the difficulty of the items (test questions) and probability of guessing.

Konstantin Augemberg; June 18, 2007 at 9:33 AM
Stephen Davies said...: Ok so should the question really be based around "is MC testing really the way to test or certify a professional full stop?".

I acknowledge that knowledge based testing (e.g. MC) is a component of testing, but too many certifications today are based solely on MC alone. Even well executed MC tests, developed by professional psychometricians (I am particularly thinking of the great tests that the American Society for Quality does) don't give 10% of how a person actually performs in the workplace - isn't this what industry really needs?

SO how do you manage to guage a person's competency (the application of knowledge) - MC tests alone certainly can't do it but this seems to be the norm.

Am trying to get some feedback on this topic at:
http://personnelcertification.blogspot.com/2007/06/multiple-choice-testing-fraud.html; June 18, 2007 at 12:15 PM
Skaarj Kaag said...: Law schools control the supply of lawyers at the back end (after law school; the bar exam is hard). Medical schools control the supply of physicians and surgeons at the front end (before medical school; getting in is hard).

To get right to the point: would you prefer that we test surgeons by having them open up live patients unassisted?; June 18, 2007 at 12:26 PM
Alan said...: Not all multiple choice exams are this simplistic. I recently took the GMAT, which is an 'adaptive' multiple choice computer-based exam. It asks harder questions if you're doing well, and easier if you're getting them wrong. The difficulty of the questions is statistically measured based on previous tests.; June 18, 2007 at 2:30 PM
jeffeb3 said...: at any rate, you'd both fail your test if you could only guess and get 52 answers right. Saying 35% of the people passed is not as useful as saying how many points you need to pass. If you need 70% to pass, and it's T/F then you'd theoretically need to know 40 answers right on. Your more studied friend would get a 90.

The system isn't flawed, it's thinking that it's a linear scale that's flawed.; June 18, 2007 at 4:53 PM
Chris Stiehl said...: The problem is not the difficulty of the test, it is the way it is being graded. The best way to do this, developed in the late 1940s at the University of Michigan, is to ask the test taker to mark only the answers he knows are WRONG, getting +1 points each time he is correct, whether True/False or multiple choice. If a CORRECT answer is rejected, he gets penalized, -3 points in a multiple choice test. With this system, there is a NEGATIVE expected value for guessing (you're trying to get +1 with a correct rejection and risking -3 for an incorrect rejection), you can get partial credit for partial knowledge (if you know one answer is wrong, but you're not sure of anything else, you can get +1 for that and leave the rest blank). You're best strategy when graded this way is to indicate what you know to be true and only that. Isn't that the purpose of the test...to measure what you know? You get punished for incorrect guesses. If you guess on every question, you will get a score near zero (a few lucky guesses that get plus points, a lot of bad guesses that get negative points), indicating you know nothing - again, the test measures correctly.; June 20, 2007 at 2:37 PM
Chris Stiehl said...: The problem is not the difficulty of the test, it is the way it is being graded. The best way to do this, developed in the late 1940s at the University of Michigan, is to ask the test taker to mark only the answers he knows are WRONG, getting +1 points each time he is correct, whether True/False or multiple choice. If a CORRECT answer is rejected, he gets penalized, -3 points in a multiple choice test. With this system, there is a NEGATIVE expected value for guessing (you're trying to get +1 with a correct rejection and risking -3 for an incorrect rejection), you can get partial credit for partial knowledge (if you know one answer is wrong, but you're not sure of anything else, you can get +1 for that and leave the rest blank). You're best strategy when graded this way is to indicate what you know to be true and only that. Isn't that the purpose of the test...to measure what you know? You get punished for incorrect guesses. If you guess on every question, you will get a score near zero (a few lucky guesses that get plus points, a lot of bad guesses that get negative points), indicating you know nothing - again, the test measures correctly.; June 20, 2007 at 2:38 PM
Unknown said...: Phooey. The entire concept of multiple choice tests is wrong. What are we trying to accomplish by testing in this way? Mastery of short answers is, in most of life's endeavors, not a skill that determines actual worth or worthiness. Multiple choice tests are given because it's easier on the test givers for scoring. It's easier to quickly (and wrongly) pigeon-hole people in this manner.

Oral and essay exams combined with some sort of experience-based practicum are a much better method for sorting the wheat from the chaff.; June 20, 2007 at 7:02 PM
Stephen Davies said...: Quoting James (NV)

Law schools control the supply of lawyers at the back end (after law school; the bar exam is hard). Medical schools control the supply of physicians and surgeons at the front end (before medical school; getting in is hard).

To get right to the point: would you prefer that we test surgeons by having them open up live patients unassisted?

... no, but I would like a surgeon who has some demonstrated skill with a scalpel. They might know Grey's anatomy inside and out (tested by MC!) but if they can't cut in a straight line then I am worried!!

My point was that knowledge is only a (small) part of the picture, ack that it is critical but certifications NEED to be about competency not knowledge.

I like your thoughts on when people get filtered (before or after). Both have their pro's and con's, what are your thoughts on the best approach? Why do the different professions do it different ways?; June 21, 2007 at 8:33 AM

Unexpected Truths

Saturday, June 16, 2007

THE FALLACY OF "HARD" TESTS

71 comments:

Blog Archive

About Me