Construct Validity

A construct, in science, is (informally) a pattern or a means of categorizing something you’re talking about, especially when the thing you’re talking about is abstract.

Constructs are really important in both qualitative and quantitative research, because they allow us to differentiate between “one of these” and “not one of these”, which is one of the first steps in measurement and analysis. If you want to describe something or count it such that other people find you credible, you’ll need to describe the difference between “one” and “not-one” in a way that’s valid. (“Valid” here means that you’ve provided descriptions, explanations, or measurements for your categorization scheme while managing or ruling out alternatives, such that other people are prepared to accept your construct, and your definition can withstand challenges successfully.)

If you’re familiar with object-oriented programming, you might think of a construct as being like a class, in that objects have an “is a” relationship to a class. In an object-oriented program, things tend to be pretty tidy; an object is either a member of a certain class or it isn’t. For example, in Ruby, an object will respond to a query of the kind_of?() method with a binary true or false. In the world, not under the control of nice, neat models developed by programmers armed with digital computers, things are more messy.

Supposing that someone asks you to identify vehicles and pedestrians passing by a little booth that he’s set up. It seems pretty obvious that you’d count cars and trucks without asking him for clarification. However, what about bicycles? Tricycles? A motor scooter? An electric motor scooter? If a unicyclist goes by, do we count him? A skateboarder? A pickup truck towing a wagon with two ATVs in it? A recreational vehicles towing a car? An ATV? A tractor, pulling a wagon? A diesel truck pulling a trailer? How do you count a tow-truck, towing another vehicle, with the other vehicle’s driver riding in the tow truck? As one vehicle or two? A bus? A car transporter—a truck with nine vehicles on it? Who cares, you ask?

Well, the booth is at the entrance to a ferry boat, and the fee is $60 per vehicle, $5 per passenger, and $10 for pedestrians. Lots of people (especially those self-righteous cyclists)(relax; I’m one of them too) will gripe if they’re charged sixty bucks. Yet where I live, a bicycle is considered a vehicle under the Highway Traffic Act, which would suit the ferry owner who wants to maximize the haul of cash. He’d like especially like to see $600 from the car transporter. So in regular life, categorization schemes count, and the method for determining what fits into what category counts too.

How many vehicles?

If the problem is tricky for physical things—widgets—it’s super-tricky for abstractions in science that pertains to humans. You’ve decided to study the effect of a new medicine, and you want to try it out on healthy people to check for possible side effects. What is a healthy person? Health is an abstraction; a construct. If someone is in terrific shape but happens to have a cold today, does that person count as healthy? Over the last few summers, I’ve met a kid who’s a friend of a friend. He’s fit, strong, capable, active… and he does kidney dialysis ever couple of days or so. Healthy? A transplant patient who is in great shape, but who needs a daily dose of anti-rejection drugs: healthy?

If your country gives extra points to potential immigrants who are bilingual (as mine does), what level of fluency constitutes competence in a language to the degree that you can decide, “bilingual or not”? Note that I’m not referring to a test of whether someone is bilingual or not; I’m talking about the criteria that we’re going to test for; our sorting rules. Economists talk about “the economy” growing; what constitutes “the economy”? People speak of “events”; when airplanes hit the World Trade Center, was that one event or two? Who cares? Property owners and insurance companies cared very deeply indeed.

Construct validity is important in the “hard” physical sciences. “Temperature” is a construct. “To discuss the validity of a thermometer reading, a physical theory is necessary. The theory must posit not only that mercury expands linearly with temperature, but that water in fact boils at 100°. With such a theory, a thermometer that reads 82° when the water breaks into a boil can be reckoned inaccurate. Yet if the theory asserts that water boils at different temperatures under different ambient pressures, the same measurement may be valid under different circumstances — say at one half an atmosphere.” (Kirk and Miller, Reliability and Validity in Qualitative Research) Atmosopheric pressure varies from day to day, from hour to hour. So what is the temperature outside your window right now? The “correct” answer is surprisingly hard to decide.

In the “soft” social sciences and qualitative research, the measurement problem is even harder. Kirk and Miller go on, “In the case of qualitative observations, the issue of validity is not a matter of methodological hairsplitting about the fifth decimal point, but a question of whether the researcher sees what he or she thinks he or she sees.” (Kirk and Miller, Reliability and Validity in Qualitative Research)

When we come to the field of software development, there are certain constructs that people bandy about as though they were widgets, instead of idea-stuff: requirements; defects; test cases; tests; fixes; discoveries. What is a “programmer”? What is a “tester”? Is a programmer who spends a couple of days writing a test framework a programmer or a tester? Questions like these raise problems for anyone who wants a quantitative answer to the question, “How many testers per developer?” Kaner, Hendrickson, and Smith-Brock go into extensive detail on the subject. I’ve written about what counts before, too.

There’s a terrible difficulty in our craft: those who seem most eager to measure things seem not to pay very much attention to the problem of construct validity, as Cem Kaner and Walter P. Bond point out in this landmark paper, “Software Engineering Metrics: What Do They Measure and How Do We Know”). (I’m usually loath to say “All testers should do X”, but I think anyone serious about measurement in software development should read this paper. It’s not hard. Do it now. I’ll wait.)

If you’re doing research into software development, how do you define, describe, and justify your notion of “defects” such that you count all the things that are defects, and leave out all the things that aren’t defects, and such that your readers agree? If you’re getting reports and aggregating data from the field, how do you make sure that other people are counting the same way as you are? Does “defect” have the same meaning in a game development shop as it does for the makers of avionics software? If you’re attempting to prove something in a quantitative, rigourous and scientific way, how do you answer objections when you say something is a defect and someone else says it isn’t? How do you respond when someone wants to say that “there’s more to defects than coding errors”?

Those questions will become very important in the days to come. Stay tuned.

For extra reading: See Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. This book is unusually expensive, but well worth it if you’re serious about measurement and validity.

5 replies to “Construct Validity”

  1. Hi Michael!
    Once we’ve got a user complain to the tool we have bought for company usage. She wrote that she experianced a serious defect: when she created comments like ‘xxx’ it was presented as underlined xxx after the comment had been saved. Our investigation showed that it was a text formatting feature the supplier put into the tool.
    So I’ve learned that a defect for one can be a feature for another.
    But it makes life really tuff when you need to put defect definition into agreements.

  2. Shortly after reading the Kaner and Bond article you reference, I started playing with the idea that we pay more attention to construct validity when we think of an attribute as being “something you measure” rather than “something you test”. In the context I was in at the time, drawing a distinction between taking a measurement as opposed to giving a pass / fail result, seemed to get people to pay more attention to the conditions under which the measurement was taken. This often generated useful discussion about construct validity. Even if no one actually called it that (except me !)

  3. Hi Michael!
    The article is written year 2004 but still valid and really good.
    At the end of it the authors talks about that several test managers have started to evaluate different test tasks ( bug reporting, test planning) instead of metrics. But they could not say if it worked better. Do you know how it went for them? Can some of them share the evaluation tables?
    The reason why I am asking is that I want to try it with one team and need inspiration. I want to try to evaluate bug reporting, planning and results reporting.
    Thank for your help in advance!

    Michael replies: If you want to know about something that the authors wrote about or proposed, why not go directly to the authors?

    Meanwhile, Keith Klain was the most prominent and highly-places manager who approaches like this. He ran the Global Test Centre at Barclays. To the best of my knowledge, the people who he managed are carrying on in this approach. You might like to ask him for details.

  4. Thank you! Of course I can ask them by myself! The article did not provide the names so that is why I asked you. 🙂




Leave a Comment