Research on software defects

Question

There is a chapter in "Making Software: What really works and why we believe it" By Andy Oram and Greg Wilson about software defects, and what metrics can be used to predict them.

To summarize (from what I can remember), they explained that they used a C codebase that was open source and had a published defect tracking history. They used a variety of well known metrics to see what was best at predicting the presence of defects. The first metric that they started with was lines of code (minus comments), which showed a correlation to defects (.i.e. as LOC increase so do defects). They did the same for a variety of other metrics (don't remember what off the top of my head) and ultimately concluded that more complex metrics were not significantly better at predicting defects than simple LOC count.

It would be easy to infer from this that choosing a less verbose language (a dynamic language?) will result in less lines of code and thus less defects. But the research in "Making Software" did not discuss the effect of language choice on defects, or on the class of defects. For example, perhaps a java program can be rewritten in clojure (or scala, or groovy, or...) resulting in more than 10x LOC savings. And you might infer 10x less defects because of that.

But is it possible that the concise language, while less verbose, is more prone to programmer errors (relative to the more verbose language)? Or, that defects written in the less verbose language are 10x harder to find and fix? The research in "Making Software" was a fine start, but it left me wanting more. Is there anything published on this topic?

I'm skeptical of the notion that more terse languages = fewer defects. — JohnFx, Jan 21 '12 at 04:31
@JohnFX Definitely. Language features like [contracts](http://en.wikipedia.org/wiki/Design_by_contract) are extremely verbose, but they are effective at reducing code defects. — Rei Miyasaka, Jan 21 '12 at 05:27
@JohnFx: You can be skeptical. But facts, sadly, are facts. More lines == more defects. The question of "why?" is very important and cannot easily be answered given a simple correlation. — S.Lott, Jan 21 '12 at 17:33
@ReiMiyasaka, an observation is a fact. Clearly you aren't using the same definitions for those words I am. — Winston Ewert, Jan 21 '12 at 22:49
@WinstonEwert No, an observation can just *happen* to have been consistent for a long time and then be contradicted by a counterexample. It happens all the time, and there are often cases (such as this one) where there is reason to believe that it *will* happen. A fact is established to be *necessarily true*, and there is no reason to believe that it will ever be falsified. Small stars have never been observed to die, but that doesn't mean that they don't die. — Rei Miyasaka, Jan 21 '12 at 22:55
"A fact is established to be necessarily true"? Surprising. I'll discuss it with the lawyers tomorrow. They always treated facts as things observed by witnesses. — S.Lott, Jan 21 '12 at 22:57
@WinstonEwert It's a fact that programs with less LOC were measured to have less defects. It's *not* a fact that programs with less LOC have less defects. The former kind of "fact" is a historical fact; the latter is a logical fact. — Rei Miyasaka, Jan 21 '12 at 22:58
See: http://programmers.stackexchange.com/questions/10032/dynamically-vs-statically-typed-languages-studies, and http://programmers.stackexchange.com/questions/98669/are-there-any-empirical-studies-on-the-effect-of-different-languages-on-software. Actually, this question may be considered a duplicate of that last question. — Winston Ewert, Jan 21 '12 at 23:22
@ReiMiyasaka, I agree with what you are saying, I just don't like the way you say it. — Winston Ewert, Jan 21 '12 at 23:29
@ReiMiyasaka, I'm not offended by how you say it. I just think that your statement wasn't as clear as it could be. — Winston Ewert, Jan 21 '12 at 23:46
What I read from Kevins question, the observation was only made for C-Code, not comparing different languages. So it is questionable, whether results from measuring LOC in C can be transfered to using a different, less verbose language. — user unknown, Jan 22 '12 at 13:58
One other point, is that just because a programming language allows you to accomplish more in fewer lines, doesn't make it "terse" ! I find python to be much more expressive than C for example. Also this is probably somewhat related to another observation about software engineering such that the most bug-free code is the code you don't have to write. — Antonio2011a, Jan 22 '12 at 20:41

score 7 · Answer 1 · answered Jan 21 '12 at 20:33

The only interesting paper that I found was Comparing Observed Bug and Productivity Rates for Java and C++ (ACM membership required), published in 1999. The author used a modified Personal Software Process to gather productivity and defect information. The author found that a C++ program contained 2-3 times more defects than a Java program, C++ generated 15-30% more defects per line, and Java was 30-200% more productive in lines of code produced over time. Since only a single programmer was used, and this individual was experienced in C++, but only learning Java, it's not definitive and an experienced Java programmer would be more proficient. However, this study was notable because it provides a methodology for measuring productivity of a programming language.

I would like to point out a few possible issues though...

Lines of code is a shaky measurement, at best. It's easy to count using various tools, and these tools often have configurable rules so that you can get the optimal counts for your organization's coding style. However, a line of Python is not the same as a line of C - you can do a lot more in a line of Python than a line of C.

The metric they discuss in Making Software, defects/SLOC, is called defect density. It's a fairly common metric, but because it's based on SLOC, it suffers from the same problems as any SLOC-based metrics. You need to have a standardized method of counting and use this consistently. You also can't easily compare defect density across programming languages with any value.

Also, simply comparing programming languages isn't necessarily a fair comparison. You also need to compare available tools and resources. For example, using static analysis and code reviews are known to reduce defects. Different languages have varying support for static analysis, unit testing, and you might not have access to people proficient enough in the language to conduct a code review.

Ultimately, I think that talking about verbosity versus defects is the wrong thing to do. It's better to look at techniques and processes to reduce defects to an acceptable level for your product. That includes your choice of programming language, with regards to productivity given your team and supporting tools.

Only a single programmer? That's far from a statistically relevant study. — DeadMG, Jan 21 '12 at 21:01
@DeadMG The intent was more to demonstrate a methodology for determining productivity and defect tracking in a way that can be used to compare languages. It's a prototype of a methodology. Unfortunately, I couldn't find anyone who expanded upon this methodology in a statistically relevant study. — Thomas Owens, Jan 22 '12 at 04:00
> If we wish to count lines of code, we should not regard them as lines produced but as lines spent. - Edsger Dijkstra — Explosion Pills, Jan 22 '12 at 20:23

score 4 · Answer 2 · edited Jan 22 '12 at 21:34

4

There's interesting stuff out there. These are my favorites, as they include empirical data, and they don't always match Accepted Dogma about bugs:

Andy Ozment's Milk or Wine - It contradicts the Management Dogma that Every Piece of Software has an infinite number of bugs (i.e. fractal bugginess).
Famed Cambridge security researcher Ross Anderson wrote Open and Closed Systems are Equivalent
Eric Rescorla, Is Finding Security Holes a Good Idea?
Geoffrey Phipps, Comparing Observed Bug and Productivity Rates for Java and C++
David Atkins, Thomas Ball, Todd Graves, Audris Mockus, Using Version Control Data to Evaluate the Impact of Software Tools
Todd Graves, Alan Karr, J.S. Marron, Harvey Siy, Predicting Fault Incidence Using Software Change History

edited Jan 22 '12 at 21:34

Dynamic

5,746
9
45
73

answered Jan 21 '12 at 19:21

Bruce Ediger

3,535
16
16

1

+1 For linking to interesting papers. I found the one about Finding Security Holes interesting: it flies in the face of everything that is commonly believed today. – Andres F. Jan 22 '12 at 01:14
If I recall, Andy Ozment's paper is a reaction to Rescola's viewpoint, so if you liked Rescorla, make an effort to read "Milk or Wine" as well. – Bruce Ediger Jan 22 '12 at 21:51

score 3 · Answer 3 · answered Jan 21 '12 at 23:45

Some languages have shorter programs because the provide pieces of implementation for the programmer. If the coder doesn't have to implement something, they can't very well introduce defects into it. If I were to implement a hash-map in C, I'd probably make mistakes and defects. In python, I wouldn't because I already have a hashmap.

Other languages are shorter because they allow the programmer to write less text to get the same program. Compare C++:

for(int x = 0;x < 10;x++)
{
    std::cout << x << std::endl;
}

and Python:

for x in range(10):
    print x

You can see even on such a simple example that python takes way less characters then C++. But its not about the amount of characters you have to type, but rather the number of things you could have gotten wrong. The python simply has less possible mistakes to make then the C++ version.

All else being equal, I think more conciseness leads to less bugs because they are simply less things to get wrong. But all else may not be equal. The verboseness could be useful for other reasons and thus pay for itself.

+1 Well said! It's not about the characters you type, but the amount of stuff you can get wrong. — Andres F., Jan 22 '12 at 15:04

score 3 · Answer 4 · answered Jan 22 '12 at 00:37

It would be easy to infer from this that choosing a less verbose language (a dynamic language?) will result in less lines of code and thus less defects. But the research in "Making Software" did not discuss the effect of language choice on defects, or on the class of defects. For example, perhaps a java program can be rewritten in clojure (or scala, or groovy, or...) resulting in more than 10x LOC savings. And you might infer 10x less defects because of that.

no no no :-) There is no way to infer from the fact that more lines of C code correlate with more defects that a language that is more expressive will have less defects for the same problem. The first is a no brainer, for a given developer, given langauge, more lines of code will have more bugs.

But is it possible that the concise language, while less verbose, is more prone to programmer errors (relative to the more verbose language)? Or, that defects written in the less verbose language are 10x harder to find and fix? The research in "Making Software" was a fine start, but it left me wanting more. Is there anything published on this topic?

It's definitely possible. Maybe it's less readable? Maybe it has functions names and operators that are hard to remember and easy to confuse? Take C and rename printf() to p(), sin() to s(), cos to c(). Shorter code, right?

Studying this sort of stuff is extremely tricky. Think about all the variables you need to control for. Experience, size of project, application area, lanuage, enviornment, management. Maybe this works for numerical analysis but doesn't work for web development? Maybe you can show it for students but not for senior developers? Perhaps it's true for a developer sitting in an open space but not in an office? Maybe it's true for Americans but not for Japanese? Maybe it's true for open source but not for commercial developers?

There are numerous studies on the productivity of programming languages, e.g.:

Haskell vs. Ada vs. C++ vs. awk vs.... an experiment in software prototyping productivity

Back in the day when people were pushing for Ada there was a few studies showing how better it is in terms of defect rates and productivity. Ada is an example of a language designed towards the goal of eliminating defects.

This question reminds me of this blog post Hume, Causation & Science

My suggestion is to find out what works for you given your strenghts and application area. Don't look for studies to tell you what to do, a study can give you some good ideas but you need to understand the limitations of these studies.

Here's a personal observation: You can train yourself to lower your defect rate. I found that doing TopCoder competitions where you get immediate feedback for defects will train you pretty quickly to eliminate some common defects and think about corner cases (which is where a lot of defects are found). This sort of stuff will make an interesting study...

In terms of predicting defect rate I read a paper a while back on correlation between static analysis results and defects: The Effectiveness of Automated Static Analysis Tools for Fault Detection and Refactoring Prediction (tl;dr not exactly correlated but bad code tends to have more issues though the defects may not be identified).

+1 For the very interesting paper about Haskell. The researchers do mention Haskell's conciseness as one of its benefits, though they admit it's subjective. — Andres F., Jan 22 '12 at 15:24

score 2 · Answer 5 · edited May 23 '17 at 11:33

2

I haven't read Making Software, so I can't comment on the conclusions they have drawn in their book. Plotting Errors vs Lines Of Code can be misleading at best, because lines of code doesn't necessarily imply the degree of complexity of the logic contained within. Lines of Code is often more useful when determining time vs effort/cost, as a management tool at very roughly gauging a team's capacity for work, and even then it can be flawed.

When using metrics to measure code quality and the likelihood of error, I've found the following to be very useful:

The use of a good profiler can be of great assistance in providing useful measurements to help not only optimize code, but to get a feel for the likelihood that a significant number of bugs are likely to be inherent in the system. FWIW, I find the Redgate ANT profile to be one of the better .NET profilers.

Now back to your question. Software Metrics have been discussed since at least the early 70's to the best of my knowledge, although If I had to guess I'd say it's likely the topic was of interest even earlier than that.

...is it possible that the concise language, while less verbose, is more prone to programmer errors (relative to the more verbose language)? Or, that defects written in the less verbose language are 10x harder to find and fix?

I would say that not only is it entirely possible, it is also very likely, both for the brevity of the language, and also due to its inherent complexity. For example, Assembler is about as brief as a language can get syntactically, and requires very careful attention to detail in order to minimize bugs, yet in being more verbose you are given many more opportunities to confirm the state by setting breakpoints at virtually every state change. That said, it's usually harder to locate problems by eye alone in assembler as compared to a higher level language with well factored code.

If on the other hand by reduced verbosity you mean a high-level language which handles a number of common operations through a single command, then you need to assume that the command as a shortcut is itself without error, and that the purpose to which you've applied the command is also without error. Making the command syntax less verbose can also have the effect of introducing an element of obfuscation to the code depending on the manner in which the syntax is used.

Is there anything published on this topic?

The Wikipedia article on Software Metrics has some links to some papers, and so too do each of the metrics that I listed earlier. As to how well peer reviewed all of that information is I am not entirely certain, but yes, there appears to be a lot of published information out there on this very subject, which is quite likely the meat and potatoes of many an advanced CompSci thesis. ;)

edited May 23 '17 at 11:33

Community

1

answered Jan 21 '12 at 07:35

S.Robins

11,385
2
36
52

I think the attitude that "more is always worse" is a knee-jerk cultural overreaction to companies like IBM in the 80s judging programmer productivity by LOC written, to which Bill Gates famously responded that it's like measuring aircraft progress by weight. – Rei Miyasaka Jan 21 '12 at 21:53
1

I don't see that is really an answer to the question. It spends half its time dismissing LOC, without mentioning the critical point in the OP that nothing outperformed LOC significantly for predicting defect count. The actual question was only addressed with a single sentence. – Winston Ewert Jan 21 '12 at 23:16
@WinstonEwert With respect, I answered the OP's question which was asked as quoted in my response. I offered alternatives to LoC should the OP be interested to look into alternative metrics. I also did not *dismiss* LoC per se, but identified that it logically is not a direct correlation to the presumed outcome. For instance, is the design loosely or tightly coupled, or does it have a high degree or recursion? LoC isn't necessarily a direct cause for having a high number of bugs, but rather indicates a possibly indirect statistical significance if chance is involved... which it shouldn't be. – S.Robins Jan 22 '12 at 09:46
But you spent, as I said, only a single sentence answering the actual question. You don't back up the claims you make in that sentence with either argument or evidence. You've provide more argument for your position in your comment then you did in your actual answer. – Winston Ewert Jan 22 '12 at 15:51
As for chance being involved, its gotta be. We accidentally hit the wrong keys or make an incorrect logical inference and result with broken code. What is that if not chance? – Winston Ewert Jan 22 '12 at 15:53
@WinstonEwert For your benefit I will add more commentary to answer the OP's Question. As for my position, to which are you referring? If you are talking about LoC, that was not what the OPs question was about. As I said, I fully quoted the OP's actual questions in my answer. As for the issue of chance, failure to check the quality of your work isn't chance, it's neglect, and therefore very unprofessional. Using other more reliable metrics is a tool to help you double-check the quality of your work. – S.Robins Jan 22 '12 at 20:43
@S.Robins, agreed that LoC isn't what the OPs question is about. That's why I objected that your answer spent most of its text on it. I think you answer is much better now that you've expanded your answer to his actual question. I've removed my downvote. – Winston Ewert Jan 22 '12 at 21:31
I agree with your point that concise languages can lead to obfuscation which would make it much harder to code correctly. However, I'm confused by your description of assembler as brief. I would have thought of assembly as being the opposite of brief because it takes many lines to do what takes only a few in higher level languages. – Winston Ewert Jan 22 '12 at 21:37
@WinstonEwert LOL. That's what I get for thinking ahead to what I WILL write, instead of concentrating on what I'm writing at the moment. I've effectively left an incomplete sentence. I will correct. – S.Robins Jan 22 '12 at 22:05
Just as a note, I believe that some of the metrics you mention are the ones that the "Making Software" research called out as being no better than LoC (particularly Cyclomatic Complexity rings familiar, but I'll check and post back). – Kevin Jan 24 '12 at 16:18
Cyclometric Complexity doesn't really provide a predictive measure. What it does is highlights potential problem areas in the code where you might find problems which are easily hidden and yet difficult to decode. Such areas might need to be complex by design, or might benefit from refactoring to something simpler and potentially easier to understand. This is probably why studies show wildly varying results, because predictive analyses of code metrics don't zero in on specifically complex areas of code. Using metrics to find difficult to test/debug code is usually much more useful IMHO. – S.Robins Jan 25 '12 at 00:42

S.Lott · Answer 6 · 2012-01-23T20:12:03.333

It would be easy to infer from this that choosing a less verbose language (a dynamic language?) will result in less lines of code and thus less defects.

Easy and as close to correct as you can get with that dataset. Lines of code are where the errors must be. Less code is less places to harbor errors.

But the research in "Making Software" did not discuss the effect of language choice on defects, or on the class of defects.

Actually. It did. It gave you the facts. Fewer lines of code means fewer errors.

For example, perhaps a java program can be rewritten in clojure (or scala, or groovy, or...) resulting in more than 10x LOC savings. And you might infer 10x less defects because of that.

That would not be correct. They said it correlated. They did not say that the relationship is linear.

There are errors outside the code. Specification errors. Testing errors. There are non-code defects as well as code defects.

But is it possible that the concise language, while less verbose, is more prone to programmer errors (relative to the more verbose language)?

No. Concise languages have fewer errors. It really is that simple.

The problem is that super-concise languages can be difficult to write. So the cost goes up even though the number of errors goes down. Maintenance can be very difficult. Languages like APL or I are often write-once languages. The code is so dense that it's hard to make a change; it's often easier to begin again from the beginning and write all new programs.

The other problem is that non-code errors start to dominate after a while. A good, concise language -- particularly a DSL that's really a good fit to the problem domain -- reveals that there are numerous other sources of errors.

To avoid a lot of nuanced (and pointless) discussion.

The study quoted is obviously incomplete.
If you have a more nuanced explanation, please conduct your own study and publish your own results so that we can discuss them.
Please focus on question, if possible. If you want to introduce new evidence, either open a new question or comment on the question itself to introduce new evidence.

Re: "Easy and correct...". No, because correlation does not imply causality. No real nuance here, it's just wrong to say there is enough evidence to reach that conclusion. Auto accidents are correlated with time spent in the car, but driving the fastest possible doesn't reduce accidents. Is this a fair analogy? - we don't know because the research in "Making Software" doesn't allow us to tell. But the analogy *is* a counterexample to your general line of reasoning. — psr, Jan 23 '12 at 18:23
@psr: Actually, that is nuanced. The question is simply about correlation, and lots of evidence backs up this correlation as having some meaning. And. Many folks have ways to work around the correlation. Indeed, all the objections to the simple reasoning involve specific techniques to break the correlation (i.e., better QA, better project management, etc.) There are innumerable degrees of freedom. Introducing more yet more variables is what I'm called "nuanced". — S.Lott, Jan 23 '12 at 19:03

Rei Miyasaka · Answer 7 · 2012-01-22T19:41:31.287

Edit: I've rearranged my answer because I want to emphasize a different point.

Verbosity can sometimes help to make defects easier to detect and fix. Consider the quadratic formula (positive root only for simplicity's sake):

float quad(float a, float b, float c)
{
   return (-b + sqrt(b*b - 4*a*c)) / (2*a);
}

And this equivalent implementation:

float quad(float a, float b, float c)
{
    float discriminant = b*b - 4*a*c;
    float numerator = -b + sqrt(discriminant);
    float denominator = 2*a;
    float root = numerator/denominator;
    return root;
}

The latter, while somewhat extreme, is a common style that makes stepping in the debugger easier, and also reduces the risk of mis-bracketing -- a common mistake. Compound expressions, while obviously a common feature in high-level languages, aren't ubiquitous, and they do often make bugs harder to find.

It's perfectly conceivable that a math library written by someone who prefers the former style might be 3KLOC, and a math library written by someone who prefers the latter might be 10KLOC -- and that the latter has less defects.

The research in "Making Software" was a fine start, but it left me wanting more. Is there anything published on this topic?

I don't think sheer LOC is a good estimate of the number of bugs in many (most) languages, even if it happens to be an effective measure in C. Constraints will often increase LOC while reducing defects. There are many kinds of constraints, including many which are language features not available in C:

Proper-English naming conventions
Unit tests/TDD
Runtime assertions
Type-safety
Pre/post-conditions
Invariants
Units of measure (arguably a type of type-safety)

But according to this study, TDD can nearly double LOC but reduce bugs. (I don't believe that this is a result of TDD so much as it is a result of unit testing, but that's another story.)

No, TDD is not a language feature, and no, I don't think unit test code should count towards LOC directly, but it is very much another form of constraint, and it shows that constraints, even if they increase code, work to reduce bugs.

Also, if for instance all languages supported units of measure except one language, all else being equal, that language would be more concise, and programs in that language would have more bugs, including ones that can crash spaceships.

I agree that unit tests do not or should not count in LOC measures. — Kevin, Jan 22 '12 at 08:18
I agree with your assessment of the importance of contracts/constraints checking, but aren't they a kind of programming language "meta-feature", and not what we're talking about here? — Andres F., Jan 22 '12 at 15:18
@AndresF. I think so, it's technically a different class of code, but I'm certain that things like type definitions (which wouldn't exist in dynamic languages, and which in the DbC community are considered to be a type of contract) weren't exempt from the LOC count in the study. I also recognize that that's different from the matter of conceptual density in terse languages leading to cognitive difficulty, but the question seems to ask about both. — Rei Miyasaka, Jan 22 '12 at 19:10
you make a good point about coding style affecting lines of code. In the "Making software" study, they looked at an open source project that was considered high quality and was already complete. Then they did the study from bug tracking and version control commits. But I can imagine if the participants in a study knew what was being tracked they could create bias towards conciseness. — Kevin, Jan 25 '12 at 17:28

Research on software defects

7 Answers7