DevelopsenseLogo

The Most Useful Metrics

A correspondent on LinkedIn asked recently, “What are the useful metrics for software quality?” Here’s my answer (lightly edited for the blog).

Update, September 20: Whoops! Thanks to Anonymous, I notice that I’ve mistranscribed the original question, which was “What are the useful metrics for code quality?”

Measurement is “the empirical, objective assignment of numbers, according to a rule derived from a model or theory, to attributes of objects or events with the intent of describing them.” (Kaner and Bond, 2004)

So what are you trying to describe about the code?

Quality is “value to some person(s)” (Weinberg), to which James Bach and I add “who matter”.

It’s not clear to me that the quality of a software application depends mostly on the stability and quality of its code base. Lots of other factors could come into play. If the product is the only thing that does what it does, then the quality and stability of the code base might not matter much. If the product isn’t available to its users, then the quality and stability of the code base doesn’t matter at all to them–although it may matter a lot to the development organization. A product could have dozens of bugs per thousand lines of code, but if those bugs don’t matter to someone that matters, then the metric doesn’t matter either.

In my opinion, it’s not worthwhile to talk about metrics until you’ve determined attributes of the code that you’d like to evaluate. Most of those attributes are bound to be subjective, and putting a number on them is a case of reification error. But we could still evaluate the product by assessment, rather than measurement–describing it, or telling a story about it, rather than subjecting it to some numerical model. Here are some of the things that I would consider important in evaluating a code base:

1) Is it testable? (Does it include logging, scriptable interfaces, real-time monitoring capabilities?)

2) Is it supportable? (Does it contain helpful error messages? Can it guide the user, the technical support department, the testers, and the developers on what to do when it gets into trouble?)

3) Is it maintainable? (Is it clearly written? Nicely modular? Easily reviewable? Accessible to those who need access to it? Is it subject to version control? Is it accompanied by unit tests? Are there assertions built into debug builds of the code?)

4) Is it portable? (Can it be adapted easily to new platforms? Operating systems? Does it depend on OS-specific third-party libraries?)

5) Is it localizable? (Can it be adapted easily to some geographic or temporal region that is different from its current one?)

Those are just a few of the questions I would ask. Note that few to none of them are expressible as a number–but that all of them could be highly relevant to the quality of the code–that is, its value to some person.

Update (2009/10/19): I’ve written a couple of articles on this subject for Better Software Magazine:

Three Kinds of Measurement (And Two Ways to Use Them)
Better Software, Vol. 11, No. 5, July 2009

How do we know what’s going on? We measure. Are software development and testing sciences, subject to the same kind of quantitative measurement that we use in physics? If not, what kinds of measurements should we use? How could we think more usefully about measurement to get maximum value with a minimum of fuss? One thing is for sure: we waste time and effort when we try to obtain six-decimal-place answers to whole-number questions.

Issues About Metrics About Bugs
Better Software, Vol. 11, No. 4, May 2009

Managers often use metrics to help make decisions about the state of the product or the quality of the work done by the test group. Yet measurements derived from bug counts can be highly misleading because a “bug” isn’t a tangible, countable thing; it’s a label for some aspect of some relationship between some person and some product, and it’s influenced by when and how we count… and by who is doing the counting.

These columns were also reprinted in LogiGear’s Insider’s Guide to Strategic Software Testing Newsletter.

Unquantifiable doesn’t mean unmeasurable. We measure constantly without resorting to numbers. Goldilocks did it.

11 replies to “The Most Useful Metrics”

  1. but… but… but…

    That’s not how we report metrics. Here’s how those questions often get answered.

    1) Is it testable?
    We executed 4564 test cases last month and 5153 test cases this month.

    2) Is it supportable?
    We received 234 technical support calls and closed 90% of tickets within 24 hours.

    3) Is it maintainable?
    We tested 28 builds this month.

    4) Is it portable?
    We tested it on 10 different computers.

    5) Is it localizable?
    We conducted UAT with 12 real world users.

    It’s amazing how far we’ve come since Darrel Huff wrote about semi-attached figures over 50 years ago. (See the book How to Lie with Statistics.)

    😉

    Ben Simo
    QuestioningSoftware.com

    Reply
  2. Earlier in this blog, I think I posted something about “The ‘Unless…’ Heuristic.” I now propose “The ‘…and?’ Heuristic”, also known as “The Dangling Data Heuristic”. This is what to do when you hear the first line of a story or the first half of a sentence without the rest of it: you ask “…and?” in hopes of hearing the information that’s missing. For example:

    Me: Is it testable?

    Pointy-haired interlocutor: We executed 4564 test cases last month and 5153 test cases this month.

    Me: …and?

    (later)

    Me: Is it supportable?

    PHI: We received 234 technical support calls and closed 90% of tickets within 24 hours.

    Me: …and?

    Reply
  3. Those tell you what you’re testing, how you’re testing, what you’re looking at – they don’t tell you how effective your testing process was but DDP might….

    Reply
  4. Hi, Simon…

    You’d be more clear if you spelled out DDP.

    If you mean Defect Detection Percentage (or something like it), where you’re measuring the number of defects detected in testing versus the number of defects detected in the field, or after testing; and if you’re suggesting that it might help us to ask questions about the quality of our work, then I could agree.

    If anyone were to suggest that DDP will tell us whether we’ve done a good job, I’d have to disagree with him. Why? Because a single metric like this one can’t tell us anything useful about the quality of our work unless we know some other things too.

    Consider the variables:

    – there might have been a huge number of bugs, but the bugs found by the testers were all calamitous problems, but the bugs that they didn’t detect were all quibbles.

    – there might have been a very small number of bugs. Four defects are recorded, of which the testers found one. That’s a detection percentage of 25%, which seems low, but the product was distributed to a million people.

    – the metric might have been taken based on the number of tests found in the overall development process, such that every bug that was ever found by any test (including the tests performed by developers) got counted. This could result in a very high DDP rate.

    – the test team had access to a tiny set of test platforms, and the vast majority of the “escaped” bugs were based on platform problems.

    – for a complex product, the “escaped” defects could be counted over years, when the entire testing project was a single week; even though the testers were highly skilled and did a fantastic job in that week, the long-term numbers look bad.

    That’s just scratching the surface.

    What some metrics can do, sometimes, is to point us toward useful questions that we could ask that might assist us in assessing the work that we’ve done. This is why we reject control metrics–metrics that presume conclusiveness and are used to decide some course of action–and embrace inquiry metrics–metrics that are open-ended and prompt questions. The first question to ask about the metric is “what is our list of things that this could be telling us?”

    Reply
  5. Entirely agree Michael, I’d be sceptical of any metric which could tell us whether we’ve done a good job without doing any thorough digging into where those numbers came from.

    I was at the SIGIST conference yesterday in London and a guy did a talk on DDP. He’d implemented it on a large project and 6 months later his DDP was still at 100% so does that really tell him that he’s done an awesome job as a tester/test lead? Not really.

    It doesn’t tell him how many people have installed the product and it doesn’t tell him how many support calls his tech support team have taken which are so minor they haven’t resulted in escalations.

    The most useful metric for me might be how much more testing I’d have to had run to find those escalations which have been found in the field, and how the cost of that would balance with the cost of the customer finding them. And that’s not just financially.

    Reply
  6. The first question to ask about the metric is “what is our list of things that this could be telling us?”

    And the second question I ask about the metric is “how might this data mislead us?”

    Defect Detection Percentage (DDP) has numerous flaws. The biggest is likely the bias of the sample. DDP will not and cannot contain all things that all people who matter consider to be a defect. DDP will not and cannot tell me the significance of the defects that were not detected internally.

    Numbers without a story are likely to hurt more than help.

    During testing it is usually encouraged to report every possible defect. This does not happen in production. Not only are customers often not encouraged to report defects but the process is not made easy. I regularly encounter defects in the software products I use. I rarely report the problems because it is usually easier for me to find my own workaround than to report the problem. Even if I report a problem, I suspect it may not get reported to those counting defects.

    Different people report defects differently in different situations. Should a problem count as one defect or many? Does each instance of a defect count as one or do we lump all instances of a defect together?

    All defects are not of equal impact. Sometimes it only takes one defect found in production to make a product worthless to people that matter. If we’re just counting defects, that is valued the same as an inconsequential typo. Even typos may not be of equal consequence.

    DDP might be valuable if it leads us to ask important questions about the story behind it. We will get into trouble if we accept numbers without asking for the story.

    Pointy-haired interlocutor: Last month, our DDP increased from 96% to 99%.

    Me: … and?

    Reply
  7. (Anonymous == Too Shy To Sign)

    The original poster was talking about code quality–specifically about evaluating the code itself.

    Now, if you wanted to talk about product quality, you’re absolutely right–but I’ll see your “does what it’s supposed to do” and raise you “doesn’t fail to do what it’s supposed to do” and “doesn’t do what it’s not supposed to do”. The way we break that down includes:

    Capability: (essentially what you said; does the product do things that it should be able to do?);

    Reliability: (does the product do things things more than merely occasionally, intermittently, accidentally, approximately?);

    Usability: (does the product follow the user’s workflow? does it provide affordance–that is, does it tell the user about what it can do? is it accessible to people with disabilities? does it offer a variety of useful interfaces where it’s appropriate to do so?)

    Security: (does it frustrate disfavoured users? does it offer its services only to people who are permitted to access it? does it do a reasonable job of making sure that the user is who the user claims to be?)

    Scalability: (can its use be extended to the limits that matter to people who matter? can it handle growth, with the corresponding increasing loads, more numerous and more diverse platforms, increasing stress? if it’s a big system, could it be scaled down if necessary?)

    Performance: (does the application respond sufficiently quickly and appropriately to user input? can it continue to do so under high volume, heavy loads, limited resources?)

    Installability: (can it be installed easily? scalably? can it be configured such that the user can get up and running quickly and easily? is there an uninstallation program that removes the program reliably while optionally leaving the user’s data intact?)

    Compatibility: (does it support the operating systems, communication protocols, file formats, data structures (etc.) that it’s expected to support? does it play nicely with other programs on the system? does it support legacy data?)

    Note that with the exception of performance, narratives are generally more likely to be more useful than numbers. Most of these points are subjective, subject to decisions by people who matter. For all the skill that testers should develop in measurement and statistics (and critical thinking about them), testers should also develop skill in using and developing heuristics and narratives (and critical thinking about them) that help to tell the testing story as credibly and as quickly as possible. Skill with numbers is a subset of this larger skill set.

    —Michael B. (brave enough to sign)

    Reply
  8. Say, Anon… I owe you thanks for drawing my attention to the error at the top of my original post. That’s been addressed, way up there.

    —MB

    Reply
  9. I’ve been uncomfortable with the idea that testers are primarily storytellers, that our work product is stories, that we are folklorists, ethnologists, reporters. It always seemed to me to mean that we had nothing “serious” to contribute. This blog post and the responses to comments helped spell out better than I’ve seen before why this is our role, and why it is important that we honor and value the nature of our contributions to our teams. If we don’t, who will? I appreciate the help, Michael. – Geordie Keitt

    Reply
  10. Hi Micheal,

    Let me thank you for pushing it here & the discussion is useful.

    My objective here is to indetify the Usefull Metrics over the Code Quality. So it boils down to the application code.

    Most of the discussion here went around the whole application

    My context here is that these metrics should help the Developer to check back on the quality of his code and a step ahead to go on from the gutfeeling levels

    Things likes the following helpful

    1. Is it testable ?
    2. Does it follow the laguage conventions & guidelines
    3. Does it supportable in the long run.
    4. Does it compatable with other environments (like OS, App Server, Databases etc)
    5. Is it scalable to chages in Technology (might need to upgrade from JDK1.4 to JDK 1.6 etc)

    Happy Testing…

    Regards
    Venkat.

    Reply

Leave a Comment