Unexpected Code Coverage Reduction

Question

I am adding unit tests to some existing code which initially was not tested at all.

I have set up my CI system to fail builds where the code coverage percentage is reduced compared to previous build - mainly to set a path of continuing improvement.

I then encountered an unexpected and annoying (although mathematically correct) situation that can be reduced to the following:

Before refactor:
Method 1: 100 lines of code, 10 covered --> 10% coverage
Method 2: 20 lines of code, 15 covered --> 75% coverage

Total: 25 / 120 --> ~20% coverage
After refactor:
Method 1: 100 lines of code, 10 covered --> 10% coverage (untouched)
Method 2: 5 lines of code, 5 covered --> 100% coverage

Total: 15 / 105 --> ~14% coverage

So even though IMO the situation was improved, my CI system obviously does not agree.

Admittedly, this is a very esoteric problem, and would probably disappear when the bulk of the code will be covered better, but I would appreciate insights and ideas (or tools/configurations) that might allow me to keep enforcing an "improvement" path for coverage with my CI system.

The environment is Java, Jenkins, JaCoCo.

One way to view the effect: total coverage ratio is the weighted sum of coverage of the components, weighted by their relative size (.833 * 10% + .167 * 75%, etc.). So your refactoring also changed weights, giving more weight to less covered components. — Pablo H, Nov 08 '19 at 10:56
My suggestion? Disable the coverage nanny temporarily for this commit and move on. These tools are aids to measure code quality, but they shouldn't be the only tool you use to judge code quality. You **know** you actually have improved the code, if the measurement tool didn't catch that, that means the tool is wrong, so just tell it not to freak out, and don't freak out yourself worrying too much about it. — Lie Ryan, Nov 08 '19 at 11:28
I built a tool which is a little better at managing this situation, as long as the lines you have changed are still covered, the tool passes, so in this case the 5 new lines are covered diffFilter would pass the build. https://github.com/exussum12/coverageChecker — exussum, Nov 08 '19 at 13:52
I think you've encountered [this](https://en.wikipedia.org/wiki/Simpson%27s_paradox). It's a shame the tool doesn't check whether you've improved or preserved results for each part. — J.G., Nov 08 '19 at 15:36
@LieRyan Please post your suggestion properly as an answer and leave comments for questions seeking clarification. — Captain Man, Nov 08 '19 at 15:37
The metric isn't lying to you. You're relying on a metric that counts the coverage of lines, not e.g. average coverage of classes or files. The problem you're highlighting is effectively how you want it to be measured. If you want it to be measured differently, use a different metric. — Flater, Nov 08 '19 at 15:54
If you are enforcing a policy that emphasizes code coverage, then you need to emphasize coverage in your development efforts, i.e. prioritize coverage of the 90 un-covered lines in your method #1. In other words, use the 80/20 rule - focus first on increasing coverage of method #1 (or perhaps reduce its line count) before tackling the method #2 refactor. — Zenilogix, Nov 09 '19 at 03:50
"I turned the light on but I want it to be dark now, what should I do?" — OrangeDog, Nov 09 '19 at 16:23
@LieRyan - That was exactly what I did. I asked this question as I expect to experience this issue for a while, until the average coverage is above 50%. — Jonathan Gross, Nov 10 '19 at 10:05
@Zenilogix - From a numbers' perspective you are right, but I try to first test not the largest pieces of code, but the more critical and referenced pieces of code. I could not find an easily implementable metric that will measure what I want, so I added this imperfect percentage metric, and it "failed" me. — Jonathan Gross, Nov 10 '19 at 12:17
@Flater - my question was exactly this - is there a tool (or a setting for my existing tools) that will use metric with a better fit to my needs. — Jonathan Gross, Nov 10 '19 at 12:18
@PabloH - Exactly, hoped to find a relatively easy way to use a metric that better fits my needs — Jonathan Gross, Nov 10 '19 at 12:21
It's the same problem as measuring police effectiveness with statistics. Imagine you're a chief of police. You can arrest drug lord who supplies drugs to the whole city, and you get 1 arrest (at the cost of great danger to policemen taking part in the action). Or you can send them out to round up kids smoking pot, and get 100 arrests basically for free. See the problem? It's the same fallacy as megapixels war: "when you can't measue the thing that matters, measure something else and make it matter". But neither coverage %, number of arrests nor megapixels really matter. — Agent_L, Nov 10 '19 at 15:06
@Agent_L - Interesting analogy, but coverage is (at this point) useful to me, although not perfect. 75% covered code is *almost definitely* better tested than 15% covered code, although not necessarily better tested than 90% covered code. We do use other methods to make sure the tests are effective (i.e. quality or arrests) such as code reviews. Note that I don't strive for 100% coverage, but a gradual increase, and was asking how to improve my metric. — Jonathan Gross, Nov 10 '19 at 18:38
@JonathanGross I'm not advocating abandoning the metrics. My point is to use the metrics as long as it serves the purpose (of good code here). But don't let your focus slip away from the elusive purpose and fix on the tangible metrics. That's what we, humans, are prone to. My examples are only to show it's a common pattern, something one should always be aware of. It's impossible (or maybe just infeasible) to improve the metrics to be 100% accurate. The time spent on improving the metrics is usually better spent on improving the code. — Agent_L, Nov 11 '19 at 14:33
Fun fact: this is known as [Simpson's Paradox](https://en.m.wikipedia.org/wiki/Simpson%27s_paradox) — Jared Goguen, Nov 13 '19 at 23:59

Berin Loritsch · Answer 1 · 2019-11-09T18:38:49.100

47

The problem I see here is that you have made the code coverage a trigger for build failure. I do believe that code coverage should be something that is routinely reviewed, but as you have experienced, you can have temporary reductions in your pursuit of higher code coverage.

In general, build failures should be predictable. The following make good build failure events:

Code will not compile
Tests will not run
Tests fail
Post build packaging fails (i.e. can't make containers, etc.)
Packages cannot be published to repositories

All of these are pass/fail, quantitative measures, they work (binary value 1) or they don't (binary value 0).

Code quality should be monitored because they are qualitative. NOTE: percentages are a qualitative measure even though they have a number associated with it. The following are qualitative measures:

Cyclomatic complexity (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
Maintainability index (a number associated with a qualitative concept, i.e. the number has a qualitative meaning)
Percentage covered lines/branches/methods (qualitative summary of quantitative results)
Static analysis results (no numbers involved)

As with any trend, when you look over past releases you can find momentary reductions while the overall trend improves or stays the same (100% remains 100%). If the long term trend is toward less tested or maintainable code then you have a team problem you need to address. If the long term trend is toward higher quality code based on your measurements, then the team is doing their jobs.

edited Nov 09 '19 at 18:38

answered Nov 07 '19 at 21:35

Berin Loritsch

45,784
7
87
160

7

I'm not sure I agree with your classification of "qualitative" vs. "quantitative" – I thought quantitative means having a number (which is the case in your second group). – Paŭlo Ebermann Nov 09 '19 at 01:33
@PaŭloEbermann I agree; it seems quantitative and qualitative have been used exactly backwards in this answers. A pass/fail test is qualitative, the degree of test coverage is quantitative. – Williham Totland Nov 09 '19 at 16:33
qualitative: relating to, measuring, or measured by the quality of something rather than its quantity. **in this case** just because a number is assigned for relative comparison sake, doesn't mean that number has any relevance taken in isolation. Pass/Fail is 0 or 1, binary, a concept that we should all be familiar with, and very much a quantity. – Berin Loritsch Nov 09 '19 at 17:32
In general, percentages are used in qualitative analysis, particularly when comparing things that are of unequal size where the proportion of things important. – Berin Loritsch Nov 09 '19 at 17:35
1

@BerinLoritsch - I agree, and your insights are valuable, coverage metric is not a perfect metric to fail a build on, but I am pragmatic here - I noticed that sometimes, especially when under pressure, people tend to cut corners, so I wanted gatekeeper. The gatekeeper is pretty weak - just don't make things worse, coverage-wise. Sometimes the "we are all grown-ups here" notion is not working if not enforced. – Jonathan Gross Nov 10 '19 at 10:09
I strongly disagree. You do not want to make quantitative problems fail all builds sure, but having higher standards for pulling into master is perfectly fine. You simply don't want lower quality code in your main branch even if it's "just for a while" - do your improvements in a separate branch and merge when finished. That said obviously you have to figure out how to define better and worse, which as this shows isn't necessarily easy. Code coverage in particular is probably not a great choice for this, since there are simply things which are hard to test or simply impossible. – Voo Nov 10 '19 at 13:12
@Voo, who is calling for lower standards? Read the whole answer. Should teams use whatever they can to improve code quality, Absolutely!!! Should it automatically fail an otherwise valid build? No. That's why you peer review pull requests, to ensure the code meets quality standards. I think you are attributing something to my answer that is not there. – Berin Loritsch Nov 10 '19 at 19:25
@Berin My disagreement is about the position that builds should never cover qualitative measures. The master branch of the code should stand up to higher standards than simply "does it compile and are all tests green". You want to guarantee that code quality has a certain level and peer review is notoriously bad to guarantee these thing during times of stress. – Voo Nov 10 '19 at 20:22

score 37 · Answer 2 · answered Nov 07 '19 at 19:59

37

Have you considered not using code coverage metrics?

I'm not going to argue that code coverage isn't something that you should look at. It absolutely is. It's good to keep track of what was covered before a build and after a build. It's also good to make sure that you're providing coverage over new and modified lines of code in a change (and, depending on your environment, may be tools that can facilitate this).

But as you're seeing right now, you can refactor and end up removing lines that were covered such that the percentage of lines covered go down. I'd agree with your reasoning - the logical coverage has not changed. You've also probably simplified other aspects of your code (I'd be curious in how various complexity measures have changed before and after your refactoring, for example). But the math is correct - your percentage of lines covered has, in fact, gone down.

Additionally, code coverage says absolutely nothing about the effectiveness or quality of the tests. They also say nothing about the risk behind the part of the code. It's relatively easy to write tests that cover trivial lines or check the most simple logic while not giving much confidence in the quality of the system.

answered Nov 07 '19 at 19:59

Thomas Owens

79,623
18
192
283

11

"*Additionally, code coverage says absolutely nothing about the effectiveness or quality of the tests.*" very important point. I've heard from people who worked in companies that demands 100% code coverage or as close as possible (say, 95% or 98% minimum or thereabout). They end up writing tests for dumb getters and setters and not even bother to *assert* the results are correct - the tests are in the form of `call getter` -> `done`. Technically, that's a test. Practically, tells you nothing. Other tests can also work in this fashion and not show correctness of code at all. – VLAZ Nov 08 '19 at 07:56
3

Also note there are different types of coverage. Line coverage, condition coverage etc. Line coverage is the worst to have. – Sulthan Nov 08 '19 at 08:50
3

While CC is not an amazing metric. It is useful metric. If I see project with 10% CC and 85% CC, It gives me some useful information. Of course, it doesn't tell you that your tests are perfect. But that is not reason to not have any standards. And having high CC is a good start for test automation. As having a covered code means it can be run using automated test. – Euphoric Nov 08 '19 at 10:19
For the most part, I usually have two builds, one with coverage and one without coverage. The build that gets deployed uses the build without coverage. We trigger the coverage build at certain intervals and examine the results for opportunities for more coverage. We don't cover everything, but we want the tests that are covering to be useful and helpful for any new engineers coming on the code base. So maybe this is a strategy that could be employed. – Jon Raynor Nov 08 '19 at 15:02
3

The advice I hear is that low coverage means you should think about adding useful tests, not about getting the coverage up. – toolforger Nov 08 '19 at 16:01
This is worthy advice. I like CC when retrofitting tests to legacy code, or discovering areas that _might_ need tests. I don't like CC as a continuing metric, because people will inevitably chase that number by testing low risk code, or writing tests that are too complex/fragile. Anything to get that number up. – Wayne Conrad Nov 08 '19 at 21:42
And by "people," I mean me. I don't monitor CC on an ongoing basis because, even knowing the pitfalls, I am too tempted to write bad or useless tests in order to drive that number up. – Wayne Conrad Nov 08 '19 at 21:45
@VLAZ That's where mutation testing comes in. If the code can be changed without causing any tests to fail, you need to change your code or change your tests. – CJ Dennis Nov 09 '19 at 03:58
@JonRaynor - Having two builds is an interesting idea – Jonathan Gross Nov 10 '19 at 10:13
1

@Euphoric - My thoughts exactly - CC is not perfect, but useful. – Jonathan Gross Nov 10 '19 at 10:14
1

@toolforger While 100% or even 90% code coverage is a fool's errant, for the vast majority of code bases, if you're writing sensible tests you should have 70-80% coverage easily. You can certainly fool the system, but then the same is true for virtually every code metric you could think of. That's where code reviews and people understanding why this is important comes into play. – Voo Nov 10 '19 at 13:17

Doc Brown · Accepted Answer · 2019-11-07T21:58:49.750

18

You can mitigate the effect to some degree by allowing the relative code coverage to reduce when the total number of uncovered lines also reduces, or when the total number of lines reduces, since this are pretty clear signs of a refactoring which sets a new base line for your coverage metrics.

In your example, the total number of uncovered lines reduces from 95 lines to 90 lines, and the total number of lines from 120 to 105. That should give you enough confidence that the relative coverage is quite unimportant in this situation. However, if you add new code, your metrics reflects the expectation of not allowing less relative coverage for the new code than the relative coverage you had before.

Side note: be aware, none of these metrics tells you if the tests are covering the most sensible parts of the code base.

edited Nov 07 '19 at 21:58

answered Nov 07 '19 at 19:48

Doc Brown

199,015
33
367
565

Eventually this is the course of action I took, allowed a 2% reduction. I just don't want this to allow a slow decline over time. – Jonathan Gross Nov 10 '19 at 10:15
I chose this answer mainly for being pragmatic and effective. I implemented the same solution even before asking this question. Hopefully when the bulk of the code is tested, I would not need to use this axe. The discussion around my question was most interesting and I got many good insights and learned a thing or two as well, but at the end of the day the issue was the difficulty for a machine to appreciate the quality of a change that is (arguably) obvious to a person. Thanks everyone! – Jonathan Gross Nov 10 '19 at 11:42

score 15 · Answer 4 · answered Nov 08 '19 at 02:14

15

This is called Simpson’s paradox, and is a well known statistical issue with your approach.
You could even construct cases where you refactor and afterwards every single method has a higher coverage, but the overall coverage still went down.

You will need to find other ways to catch 'regressions' of the type you wanted to catch, not with overall percentages (although I liked the approach when reading it).

answered Nov 08 '19 at 02:14

Aganju

1,453
13
15

Obviously if the 100 line function was the single worst offender, and you change it from 100 lines / 10 covered to 1000 lines / 101 covered. – gnasher729 Nov 08 '19 at 11:16
1

Thanks, it is good to have a name for the condition :) – Jonathan Gross Nov 10 '19 at 10:17

score 5 · Answer 5 · answered Nov 08 '19 at 09:12

5

While all answers are telling not to use code coverage as a quality metrics, none really answered the question. What to do with reduced coverage?
The answer is quite simple: use absolute numbers, not percent. What is more important then coverage percent are number of untested functions and number of untested branches. Also, it would be great to get list of untested functions and branches.

answered Nov 08 '19 at 09:12

BЈовић

13,981
8
61
81

1

Yes, currently OP is in the paradoxical situation that removing code lines lowers his metric. Using absolute instead of relative code coverage can at least fix this aspect of the problem. – kutschkem Nov 08 '19 at 10:50
1

Absolute number of lines here is still a reduction - from 25 to 15. – afaulconbridge Nov 08 '19 at 12:59
6

@afaulconbridge Yes, and that number is irrelevant. As I said in the answer, what is important is increase in number of untested functions and untested branches. – BЈовић Nov 08 '19 at 13:03

score 2 · Answer 6 · answered Nov 07 '19 at 21:23

2

One option that might help with this is to switch to from line coverage to branch coverage. I say 'might' help because you can run into a similar situation with uncovered branches if you reduce branches in a function.

Overall, though, branch coverage is a more meaningful metric than line coverage. And I'll hazard a guess that when you refactor a long method into a short one, you usually reduce lines of code than you will branches. Of course, if it's really nasty code, then you might end up pruning a lot of branches. Even in that case, I think you are still better off counting branches.

answered Nov 07 '19 at 21:23

JimmyJames

24,682
2
50
92

1

The problem I see using branch coverage as the metric to focus on is that branches are harder to test and cover completelly. It might lead devs to put unnecessary efforts on covering corner-execution paths, totally meaningless for the business. Branches generate corner-cases oftener than we wish and it's hard to say whether they are dead-ends or possible (but rare) cases. It has all the receips to turn devs into paranoic-tester mode. – Laiv Nov 08 '19 at 06:46
1

@Laiv My team have been using it for years and this has never been an issue but YMMV. I find this logic suspect, however. Essentially you are saying you want to be ignorant of untested branches because it might be too hard to test them. If you don't know what is untested, how do you know that? Perhaps they are infrequent cases but the impact is critical. Branches that are not tested could be indicative of flaws in the program flow i.e. they shouldn't exist. Eliminating the branch is a possible solution. – JimmyJames Nov 08 '19 at 16:21
It is the corner case branches that generally hide the bugs however. Branches taken thousands of times a day very quickly have code that works, it is the weird edge case that hides the totally bogus code. Far more important to have tests for the error handling edge cases then for the stuff that will be noticed by the dev the first time they hit run. Coverage is a useful METRIC, but as soon as you treat it as more then a vaguely interesting number it (like all metrics) tends to become the tail that wags the dog. This is not something you need hard failing your builds. – Dan Mills Nov 08 '19 at 16:41
This is a good advice, but in my case, most of the CC metrics were reduces, lines as well as branches. – Jonathan Gross Nov 10 '19 at 10:19
@DanMills - I agree, we do not chase the coverage number itself. Having said that, assuming that every test added is effective, for new or existing code (verified via other methods such as code reviews), the coverage will increase (with exceptions such as the curiosity described in the question), so I find the automatic build fail useful, especially if triggered for merge requests, as an early indication to the developer they are missing something. – Jonathan Gross Nov 10 '19 at 18:52
1

@JonathanGross I agree that some level of requirement is useful. That way the time in code review can be spent more on whether the tests are good and meaningful and less on whether there is enough coverage. In fact, putting in rules like this can help prevent the real problem here: the large method with low coverage. – JimmyJames Nov 11 '19 at 15:42

score 2 · Answer 7 · answered Nov 08 '19 at 10:18

What you are doing is not wrong. While code coverage is not an amazing metric, it is a useful metric. It not telling you that your code is perfectly tested. But not having some level of standard you expect from your tests is also a bad idea.

One way to solve this problem is to simply change the limit if you believe the change is justified. The whole point of this kind of hard limit is to notify you quickly when something bad happens. In many cases, this reduction will be unintentional and it needs to be fixed by introducing more and better tests. But if you identify the error to be a false positive, and you can prove to yourself and your team that the reduction is expected, then it is perfectly fine to reduce the limit.

Another way is to turn that reduction into "soft" failure. For example, Jenkins can have a "unstable" result of a build. In unstable build, the final artifacts are still usable, but there might be some degradation of functionality. That gives you time to decide what to do, if tests need to be expanded or limits lowered. This could also be implemented using two different builds : one that has the build artifacts with lesser checks and limits, that when it fails, not even the artifacts are usable. And second, with much stricter limits that would fail if any of your quality metrics gets broken.

I might try to have a soft failure, although I believe it would cause confusion with the team, and require some fix anyway. — Jonathan Gross, Nov 10 '19 at 10:20

score 1 · Answer 8 · answered Nov 08 '19 at 07:28

In addition to answer from Berin Loritsch, I would like to also highlight a technique called Mutation Testing (see https://pitest.org/ for a Java tool).

Mutation testing provides metrics on which lines of codes could be mutated to behave differently and not result in Unit test to fail.

In your case, if the refactoring was done well, you should see the Mutation Testing metrics improve (when expressed as a percentage of Unit test Line Coverage).

This link describes a couple of (java-related) mutation testing tools: https://pitest.org/java_mutation_testing_systems/

This is something I will investigate – Jonathan Gross Nov 10 '19 at 10:22 — Jonathan Gross, Nov 10 '19 at 10:22

score 1 · Answer 9 · answered Nov 09 '19 at 13:34

Before refactor: 95 untested lines.

After refactor: 90 untested lines.

You should always consider how many untested lines you have!

The code coverage percentage is a good measure of this only if your code always grows, which is unfortunate.

Another way to put it: whenever you manage to reduce the amount of lines in your code, without breaking any tests, you have all rights to ignore code coverage percentage!

score 1 · Answer 10 · answered Nov 10 '19 at 18:51

Here's a solution that doesn't involve changing the definition of the code coverage metric or requesting / giving yourself an exemption to the policy. Although, the policy as stated is arguably flawed and you might be better off counting the total number of uncovered lines or enforcing a per-file lower bound (e.g. 80%) instead of requiring monotonicity.

Improve the code coverage of Method 1 prior to or at the same time as refactoring Method 2.

Here's the current situation stated in the question.

                    Old World                   New World

           covered  total   %           covered  total   %
 Method 1       10    100  10                10    100  10
 Method 2       15     20  75                 5      5 100
Aggregate       25    120  20.8              15    105  14.3

In the New World, the total number of covered lines must be at least 22 for the metric to fail to decrease. (The ceiling of 105 * 25 / 120 is 22).

So, all you need to do is add tests that cover 7 more lines of the file where Method 1 and Method 2 live.

This solution does have the drawback of including an unrelated change in the same commit in order to satisfy the code coverage monotonicity rule. It still might be worth doing if this one change to Method 2 is not important enough to justify changing the policy. It also might be worth doing if you want to get this change in first before you change the policy.

I think total number of uncovered lines would not be a good metric, as it wouldn't be possible to add any new uncovered code. Lower bounds (global, or per-file) will become useful as we cross some coverage threshold, and that's the reason for current requirement for monotony. Specifically for per-file lower bound, I couldn't find an easily integrated tool that allows this, and actually the issue could happen within a single file. — Jonathan Gross, Nov 10 '19 at 19:08
@JonathanGross You might be able to write a script that parses the output of your coverage tool and enforces per-file lower bounds itself. The main thrust of my suggestion though is that you can live with the policy as currently stated by adding more tests to existing code at the same time as you refactor. — Greg Nisbet, Nov 10 '19 at 19:29

score 0 · Answer 11 · answered Nov 08 '19 at 11:14

The problem is: You have one huge function with very little code coverage, so in order to get the percentage up, you need to add covered lines of code. If you add 20 lines / 10 covered that improves the percentage more than 5 lines / 5 covered. Even though the second is most likely better (unless speed was important, and your first function was "if (easy case) { handle easy case quickly } else { handle complex case slowly }" and the second was "handle all cases slowly" ).

So we conclude: You have a measurable thing where improving the measurement doesn't improve the quality of your code. Which means that you should only use this as a target when common sense is applied at the same time. From a book about "motivation": The most important thing is to motivate people in such a way that following the motivation produces better results.

Imagine there's a function with a line "int x = 100;". To improve code coverage I change it to "int x = 0; x++; x++; (repeat 100 times)". Instead of one covered line I have 101 covered lines. Total code coverage goes up, and I get a bonus.

The example I gave is a reduction of the real situation, it would have been the same if I had 10 functions containing 10 lines, with only 1 line covered for each. — Jonathan Gross, Nov 10 '19 at 10:28

Unexpected Code Coverage Reduction

11 Answers11

Linked