The Code Coverage Deception

January 29, 2021

tdd testing code-coverage

Introduction

Some software development teams define a set of non-functional requirements for their projects. Often, this will include a minimum code coverage requirement for the automated test suite. Usually this is specified as “at least 80% coverage” or something more sophisticated like “should only increase, never decrease”. While this may seem like an innocent and well-intentioned requirement, it is mostly useless and might even be counterproductive.

What is Code Coverage?

Wikipedia defines Code Coverage, sometimes also called Test Coverage like this:

In computer science, test coverage is a measure (in percent) of the degree to which the source code of a program is executed when a particular test suite is run. – Wikipedia

I am a firm proponent of Unit Testing and I am practicing Test Driven Development for almost 20 years. When I first heard about tools that could tell you how much of your code is covered by tests more than a decade ago, I was delighted.

In the most simple cases, Code Coverage is measured by instrumenting the code, running the test suite and counting which "parts" of the code were executed during the test run. What is actually counted and how can differ depending on the actual Code Coverage tool used. Most tools these days simply check which lines or statements were executed compared with the total number of lines or statements.

Code Coverage as a metric for the completeness of a test suite is flawed. It is entirely possible for a piece of code to have a Code Coverage of 100% with an incomplete test suite. This defeats the purpose of the requirement to have a certain minimum coverage. The requirement, no matter if it is a guideline or a hard requirement, might also yield outcomes that were not intented.

The Completeness Fallacy

A Simple Example

Consider the following two Java classes. The first class is doing a calculation of some kind and the second class contains a unit test for the first. Even if you are not familiar with Java, you can probably understand what the code is doing.

public class CodeCoverageExample {
    public int calculate(int value, int otherValue) {
        int sum = value + otherValue;

        if (otherValue > 13) {
            sum *= 2;
        }

        return sum;
    }
}

public class CodeCoverageExampleTest {
    @Test
    void test() {
        var sut = new CodeCoverage();

        int result = sut.calculate(100, 20);

        assertEquals(240, result);
    }
}

Most Code Coverage tools will tell you that the Code Coverage of the test is 100%, the perfect score. The test case executes every single executable line. Unfortunately, this does in no way suggest that the code is perfectly tested. We are being lied to. The code coverage, as it is usually calculated, just looks at which statements are executed in a test run. In our case, all statements are executed. The number of executed statements of code is a mostly useless metric because it makes you believe that if all were executed, all logical code paths were executed. This is the big deception of code coverage.

The code in our CodeCoverageExample class above, in fact, contains two code paths. One code path is executed when the test is run and it covers all the lines of code. The test calls the method with otherValue being 20, which triggers the conditional block, but we do not have a test for the case where the conditional block is not executed (i.e. when otherValue is less than or equal to 13). This second code path differs from the first because certain lines are not executed. This is something that Code Coverage usually fails to communicate.

Conditionals with Multiple Terms

A similar, but less obvious case is any conditional with more than one term. Just covering a positive and a negative case for the conditional does not cover all cases, as you now have 2ⁿ (with n being the number of terms). Let’s change the condition of the code example above.

if (otherValue < 13 || otherValue > 42)

In the test, we call the method now with the values 100 and 12, asserting 224. We know now that the code behaves correctly if otherValue is less than 13. The conditional branch should also be executed, if otherValue is greater than 42. We tested for one boundary case, but not for the other one. This examples now has three code paths. One for otherValue < 13, one for otherValue > 42, and another one for otherValue >= 13 && otherValue <= 42. That second term might be wrong and we would never know until stuff breaks.

By the way, I mentioned 2ⁿ earlier, but I only talked about three cases. The fourth case would be otherValue < 13 && otherValue > 42 which is impossible.

Iterations

Loops might have hidden coverage problems as well. If the number of iterations is variable and all your test cases work with input data that executes the loop at least once, you also get full coverage. Just like with the first simple example above, the case where the iteration block is not executed is not covered.

public class IterationExample {
    public String join(List<String> lines) {
        String result = null;

        for (var line: lines) {
            if (result == null) {
                result = line;
            } else {
                result += ", " + line;
            }
        }
        return result;
    }
}

Is it on purpose that join() is returning null if lines is empty? Do you have a test for it?

Good Coverage is not Equal to Good Code

Code Coverage is not the Goal of Testing

If you just write test for the sake of increasing the Code Coverage numbers, you can as well just stop testing. The goal of testing is code you can rely on, not some meaningless numbers. A very high Code Coverage does not tell me anything about the quality of the code. This is similar to the value of the shares of a company. The value only means that somebody is willing to pay this amount for a share. It does not mean that the company itself is healthy or unhealthy.

False Sense of Security

Frequently checking your coverage numbers and being proud of them might give you a false sense of security. Some teams even put the current coverage numbers on their team dashboards, with a red-yellow-green color code. Green might mean you hit a certain threshold with your Code Coverage. It is still possible, that the most important parts of your code are in pitiful shape.

Ignoring Trivial Methods

A common feature request for code coverage tools is ignoring trivial methods like setters and getters. People don’t want to write tests for getters and setters, just to achieve a higher test coverage.

Such a feature would just increase the code coverage artificially and hide another problem. Let’s assume you have an application that would have 100% code coverage if it were not for these trivial methods. This means that all of your code gets exercised through your test suite except for the setters and getters. This raises the question of why the trivial methods are there in the first place. If all of your code is run and the methods are not called, then your 100% coverage is superficial. All the code is run, but not all use cases are tested. This is the precise reason code coverage is deceiving.

I know that sometimes such methods are required by certain application frameworks and would never be exercised in a test scenario. But with such applications, you never reach an acceptable code coverage, anyway. Try to separate the business code into its own package or library or whatever means of modularization your programming language and platform offer. Test only that and disregard the coverage of everything else.

There is one example that is always bothering me. Sometimes I implement toString() in Java classes just to get better readability for test failures. “Peter Parker” is easier to understand than “Person@9376e182”. These methods are only ever exercised when a test fails. They will never be covered as long as the test suite is green. So what? Am I going to write tests just for these methods? Certainly not. This would be a foolish waste of time.

Gameing the Requirement

Any metric that is based on code properties will make the the metric useless, as it can and will be gamed. More formally:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. – Goodhart’s Law

In other words, if you require code to have a certain minimum Code Coverage, developers will provide for that, but potentially not in a way that would be desired.

It is rather easy to write tests that execute a huge chunk of code without actually testing it. In the most simple case, you could just execute the code, omitting any assertions on the behavior. Some testing tools can flag tests without assertions, but it does not need a genius to come up with trivial, useless assertions, that do not really assert on anything useful.

Gaming the requirement might not even be intentional and malicious. Depending on how the code is called, there might be accidental coverage. This happens, if certain parts of the code are executed, without being in the focus of the test. If the developer stops testing as soon as the coverage goal is reached, you might end up with untested code.

These are all reasons why a Code Coverage metric should never be a requirement.

Alternatives

More Sophisticated Coverage Criteria

Most Code Coverage tools only offer statement or line coverage, the latter being even more deceitful in the case of multiple statements on one line. The article about Code Coverage in the English Wikipedia offers more involved alternatives that tackle most of the problems, but they are not very common yet. A combination of different coverage criteria would probably yield the best results. I want to highlight two of them.

Branch Coverage: Measures which branches of the code have been executed. This would tackle the problems with the initial example above and the iteration example.
Condition Coverage: Measures which sub-expressions of conditionals were evaluated for both being true and false. This would also help with the initial example and also with the "Conditionals with Multiple Terms" example.

More and more Code Coverage tools are now adopting these more sophisticated coverage criteria.

Mutation Testing

Mutation testing is sometimes employed to detect if certain areas of code are not covered correctly. This is achieved by automatically changing the tested code a little (the “mutant”) and then running the tests, then reverting the change and change something else, run the test again. Rinse and repeat. If a code mutant is introduced and no tests break, that code was not covered by any of the tests.

Test Driven Development

Proper Test Driven Development (TDD) ensures that the code you deem worthy of being tested is fully covered. It forces you to refrain from writing any production code without having a test that demands it. This is by design. With TDD you always end up with fully tested code, if you play by the rules.

Still useful

Does that mean that Code Coverage is entirely useless? Of course not. I still use it to identify areas in legacy code that are not well tested. What I am looking for is areas with low coverage, as it shows that bigger parts of that area are not executed by the tests. This means I use it as a negative indicator (i.e. not enough tests), never as a positive one (i.e. enough tests).

Conclusion

Never use Code Coverage for more than an ad hoc indicator for identifying areas of code that could need some testing love. Never use it as an actual metric for anything. Code Coverage measures lines of code executed, not the quality of those line nor the quality of the test suite.

What about the code coverage requirement? Either ignore it or discuss with your fellow developers why it is useless and have it removed. The requirement was imposed upon you by higher management? Find out the reasoning behind the requirement and discuss alternatives.

If there are teams in your organization that skimp on proper test suites, a Code Coverage requirement can help to show them where they are lacking. Use it as a negative indicator, but don't fall into the trap of assuming that the code is properly tested if the coverage meets the required percentage.