58

Since we are becoming more and more reliant on computing, including very critical tasks of day-to-day life, I was just wondering how those vital components are tested.

More technically, how are the compilers and assemblers tested? (I suppose this relates to the halting problem!!)

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Sudip Bhandari
  • 1,137
  • 1
  • 9
  • 14
  • 36
    You might want to start your research with the "Ken Thompson Hack" See [Reflections on Trusting Trust](http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf) – Bryan Oakley Dec 29 '15 at 15:59
  • 7
    Here is an example of a compiler for which there is a correctness proof: http://compcert.inria.fr/doc/index.html – Giorgio Dec 29 '15 at 16:26
  • 8
    Most compilers/linkers/assemblers are tested most deeply by using them a lot in a lot of different circumstances. For finding errors, there goes nothing above having millions of users using your compiler. – Bart van Ingen Schenau Dec 29 '15 at 17:57
  • 3
    and add the Operating System to the list, as well. – Erik Eidt Dec 29 '15 at 17:57
  • Here's an example showing flowed hardware and compiler design http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance?lq=1 In short, a little-used instruction has a false dependency that slows down reads. This is a hardware problem and the compiler in question doesn't even know about it. – user2023861 Dec 29 '15 at 20:06
  • 1
    Many comments and answers only mention testing but formal verification is also used. Some languages and compilers are designed with formal verification in mind (see e.g. https://en.wikipedia.org/wiki/SPARK_%28programming_language%29). The difference is that testing verifies that certain properties of the software hold for some specific input data, whereas formal methods prove that certain properties hold for any possible input data. See also https://en.wikipedia.org/wiki/Formal_methods – Giorgio Dec 30 '15 at 00:43
  • 4
    Related: [How come compilers are so reliable?](http://programmers.stackexchange.com/questions/51966/how-come-compilers-are-so-reliable) and [Can compilers and interpreters have bugs, and what can we (as users) do to deal with them?](http://programmers.stackexchange.com/questions/204075/) –  Dec 30 '15 at 03:15
  • And how can we trust the CPU's? – Thorbjørn Ravn Andersen Dec 30 '15 at 08:32
  • 1
    Also check formal verification as one of the methods to trust more https://en.wikipedia.org/wiki/Formal_verification – Hurda Dec 30 '15 at 10:41
  • 1
    @ThorbjørnRavnAndersen Sometimes those have bugs too... https://en.wikipedia.org/wiki/Pentium_FDIV_bug – J... Dec 30 '15 at 12:01
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/33671/discussion-on-question-by-sudip-bhandari-how-can-we-be-certain-that-the-lower-co). – yannis Dec 31 '15 at 07:42
  • @ThorbjørnRavnAndersen there exist hardware developed using formal methods too. Though in the end when you build the CPU, even if the design is formally correct you may still introduce a physical problem that causes hardware errors in certain circumstances. In that cases all bets are off. If you really need to avoid that situation the only way to do that is simple design and manufacture, to reduce the chance of bugs, and redundancy, so that one hardware error isn't enough to produce a problem. This is done in aircrafts for example. – Bakuriu Dec 31 '15 at 14:32
  • 1
    Except in exceptional circumstances where relatively small pieces of work have been intensively examined and subjected to some sort of "proof" of correctness, you can't assume that any work is "flawless". And, in fact, for any work of substantial size one can reasonably assume that it *does* contain flaws, at a rate that likely can be predicted by one who studies the statistics. I have seen estimates in the range of approximately one bug per hundred lines of code, for "hum-drum" code, and I doubt that many projects have fewer than one bug per 10,000 lines. – Daniel R Hicks Dec 31 '15 at 22:27
  • Ths question, of course, recurses down to the hardware level. The basic answer is a combination of good design tools, good engineering practices, modularity and levels of abstraction (prove a smaller piece is correct and then you just have to prove the next layer up uses it correctly, and so on). The fact that consumer software development is inexcusably sloppy doesn't mean it can't be done right, it just means consumers have demonstrated that they will buy slightly-less-reliable-and-much-cheaper. – keshlam Jan 01 '16 at 00:30
  • I'm surprised nobody mentioned [CompCert](http://compcert.inria.fr/) yet. It's an attempt at making a _certified compiler_, one accompanied with a huge Coq proof of correctness for substantial portions of the translation from C to assembler. – Iwillnotexist Idonotexist Jan 01 '16 at 03:51
  • @IwillnotexistIdonotexist actually the second comment on the question mentions it... – Bakuriu Jan 01 '16 at 16:40
  • @Bakuriu I _completely_ missed that one! – Iwillnotexist Idonotexist Jan 01 '16 at 18:18
  • 2
    To OP and others who are interested in the hardware side of this question, this recent talk will be interesting: https://youtu.be/eDmv0sDB1Ak – Doddy Jan 02 '16 at 13:40
  • Well. How can we trust a formal verification? – Vlad Jan 04 '16 at 20:59

8 Answers8

105

You can't be certain, but you just assume they are, until you discover they are not. There have been plenty of bugs in compilers and hardware over the years.

The way these are tested, for example a compiler, is that they are very narrowly and rigidly defined, carefully written, then tested with an enormous test suite to verify correctness. Add to that the wide user base of a compiler, and more bugs will be detected and reported. A dentist appointment scheduling app, comparatively, has many fewer users, and fewer still that are capable of detecting defects.

SQLite consists of about 73k lines of code, while its test suite consists of about 91378k lines of code, more than 1250x times that of SQLite itself. I expect compilers and other core tools have similar ratios. Processors today are designed essentially with software, using hardware description languages like Verilog or VHDL, and those have software tests run on them as well, as well as specialized IO pins for running self tests at the point of manufacture.

Ultimately it's a probability game, and repeated and broadly covering testing allows you to push the probability of defects down to an acceptably low level, the same as an other software project.

whatsisname
  • 27,463
  • 14
  • 73
  • 93
  • 7
    I have often wondered the same question as the OP, but in regard to DBMS's. You gave a great example that answered it within the context of SQLite. Thank you! – Brandon Dec 30 '15 at 02:22
  • 8
    +1 but somehow I doubt that "compilers and other core tools have similar ratios". – user541686 Dec 30 '15 at 11:27
  • 5
    Note that (1) SQLite has actually two test suites, with non-trivial redundancy between the two and (2) there are still bugs found in SQLite despite this. – Matthieu M. Dec 30 '15 at 12:29
  • You should consider adding a link to the sqlite testing page. – Bryan Oakley Dec 30 '15 at 20:15
  • 7
    I've been under the impression that SQLite is one of the most "extensively tested" pieces of software (in terms of lines of testing code/lines of operational code) available for general use, more so even than a lot of compilers. If nothing else, a full-featured compiler is an enormous piece of software, and I can't imagine it having a thousand times that amount of test code. (GCC is reportedly up to 14.5 million lines. It seems unlikely that either the compiler collection proper is only 14k LOC, or that they have a 14-billion-line test codebase sitting around on the side! :-P) – David Z Dec 31 '15 at 10:18
  • 2
    @DavidZ: It is, it's certainly the only project I know to use extensive OOM testing for example (they use a fault injector for the test and replay them again and again failing at the 1st, then 2nd allocation... until the whole test is executed). – Matthieu M. Dec 31 '15 at 12:19
  • -1 for completely ignoring [formal methods](https://en.wikipedia.org/wiki/Formal_methodshttps://en.wikipedia.org/wiki/Formal_methods)... – recursion.ninja Jan 01 '16 at 19:40
  • @recursion.ninja: formal methods only ensure your implementation and specification match, but it will do nothing to save you when your implementation is correctly doing the Wrong Thing™. – whatsisname Jan 01 '16 at 19:44
  • @whatsisname That sounds like it would be an excellent addition to your answer. You might also want to add that formal methods can be circular in nature where an automated theorem prover proving a compiler to be correct, after being compiled by that compiler... – recursion.ninja Jan 01 '16 at 19:54
  • 2
    @recursion.ninja So wait, you downvoted an answer because it didn't say something about formal methods, despite formal methods ironically being completely unproven in practice and highly impractical for realistically-sized programs? Seems a bit silly. – Miles Rout Jan 26 '16 at 21:05
46

In layman's terms:

  1. You cannot.
  2. Compilers and interpreters are unit-tested as any other (professional) software.
  3. A sucessful test doesn't mean a program is bug-free, it only means no bugs were detected.
  4. A wide user base using the compiler during a long time is a pretty indicator of it having very few bugs, because users usually test cases the designers didn't think of.
  5. Being open source is also a good indicator. "Given enough eyeballs, all bugs are shallow... Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix will be obvious to someone.". A closed-source compiler could have bugs that arise at very specific times or that generate less-than optimal machine code and the company behind it could just simply not disclose their existence ang give it a very low priority in the product's road map.

Bottom-line:

I'd say go for OOP (Old, Open and Popular). I just made up that acronym.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • 19
    **+1** For inventing another TLA (three letter acronym) - the world doesn't have enough of those, yet. – s1lv3r Dec 29 '15 at 20:56
  • 2
    @s1lv3r: Don't you mean NEY (not enough yet)? – Williham Totland Dec 29 '15 at 21:57
  • 34
    Also, OOP had no meaning in computer programming yet. So KTT (kudos to thee)! – Pierre Arlaud Dec 29 '15 at 21:58
  • 4
    Congratulations, you have created the programming equivalent of gaming's FPS (First Person Shooter / Frames Per Second / Feet Per Second (games with bullet travel)) – Cyoce Dec 30 '15 at 02:23
  • @Pierre are you ignoring Object Oriented Programming intentionally? – Dan Oberlam Dec 30 '15 at 03:10
  • 15
    Pierre's comment is a joke @Dannnno. – yannis Dec 30 '15 at 04:04
  • 2
    @Yannis this is why I hate the internet, I can never pick up when someone is kidding – Dan Oberlam Dec 30 '15 at 04:05
  • 19
    Alternatively, it could **P**opular, **O**ld, and **O**pen. ;) Actually, that's how I'd rank them in order of importance. – jpmc26 Dec 30 '15 at 05:01
  • 23
    @jpmc26 I'd go with Practical, Old, Open and Popular. As for acronym... – StupidOne Dec 30 '15 at 07:22
  • 5
    (5) was demonstrated not to amount to much by Heartbleed. Just because something is open does not mean that people look into it! – Matthieu M. Dec 30 '15 at 12:31
  • @MatthieuM. That's covered by (1). ;) – Tulains Córdova Dec 30 '15 at 12:33
  • 1
    @user61852: Well (1) covers everything, but I wanted to specifically address the "Given enough eyeballs, all bugs are shallow." trope. Before Heartbleed it was often cited as a hallmark of Open Source and somehow people assumed that it was obvious *others* had looked for bug; Heartbleed made it clear that you had to check whether others were actually looking... – Matthieu M. Dec 30 '15 at 14:03
  • The principle of more eyes would seemingly extend to hardware. A CPU gets used by more people and more applications than a particular application or compiler, hence greater chance of finding its flaws. One might think. – WGroleau Dec 31 '15 at 05:41
  • 1
    #5 - Being open source is not a good indicator - I refer to the HeartBleed bug – PandaWood Dec 31 '15 at 05:44
  • 6
    @MatthieuM.: well, Heartbleed _was_ eventually discovered. Had it been ClosedSSL, who knows whether it would have ever been properly fixed at all, or just surpressed (perhaps even left as a convenient backdoor )? Granted, it _is_ embarrassing that the bug could go unnoticed for so long (indeed, that the commit which introduced it was ever accepted in the first place), but that has a lot to do with the general bloatedness of OpenSSL. Arguably, disasters like this would also be _much_ less likely if more concise and type-safe languages were used, rather than C. – leftaroundabout Dec 31 '15 at 16:52
  • @leftaroundabout You mean like Rust? :D – Blacklight Shining Jan 01 '16 at 19:16
  • 1
    @BlacklightShining: Rust, Haskell, O'Caml, perhaps Idris... but even F# or Scala would make control easier, or, heck, Java. – leftaroundabout Jan 01 '16 at 19:20
  • Unit testing doesn't mean no bugs detected - it means no bugs detected that were deemed important enough to stop shipping a product. – gnasher729 Jan 04 '16 at 10:10
  • @MatthieuM. That's an absurd thing to say. It was discovered because it's open source *and* there's no guarantee it would have been found if it were closed source *and* nobody said "open source prevents all bugs". Who knows how many more bugs it would have had if it were closed. – Miles Rout Jan 26 '16 at 21:07
  • @MilesRout: My point is that Open Source is certainly a necessary condition, but that it is insufficient in itself. – Matthieu M. Jan 27 '16 at 13:21
24

It's turtles all the way down.

Nothing is certain. You have no choice but to settle on confidence ratings.

You can think of it as a stack: Math > Physics > Hardware > Firmware > Operating System > Assembler/Compiler/etc

At each level you have tests that you can perform to improve your confidence ratings. Some of these tests have the quality of formal proofs, some of them are based on observation, most are a combination of both.

The tricky part is unravelling the recursion in some of these tests because we use programs to do proofs and observational analysis now where it has become too difficult to do that by hand.

Ultimately though the answer is that you try everything you can think of. Static analysis, fuzzing, simulation, running with purposefully selected extreme inputs or random inputs, running/mapping every control path, formal proofs, etc. Basically your aim in testing should always be to do everything possible to prove that your product (e.g. theory/chip/program) doesn't work as intended. If you make a genuine effort and still fail then you're allowed to improve your confidence rating in your product's correctness.

Testing is at best a semidecision process meaning that given there's a bug you will eventually find it but you can never be sure that you've found them all. Even with formally verified software you're still relying on physics, the tools used to do the formal proofs, and that the thing you proved is necessary and sufficient for your program to do what is (often subjectively) "intended". That's not to mention all the other components you're using that don't have formal proofs.

voutasaurus
  • 349
  • 1
  • 5
17

This is a "dangerous" question for new developers in that they'll start blaming their tools instead of their code (been there, done that, seen too many do it). Although there are bugs in compilers, runtime environments, OS, etc., developers should be realistic and remember that, until there is evidence and unit tests demonstrating otherwise, the bug is in your code.

In 25+ years of programming in mostly C, C++, and Java I have found:

  • two bugs due to a compiler bug (gcc and SunOS C)
  • about once every year or two a bug due to a Java JVM problem (usually related to memory consumption/garbage collection)
  • about once every month or two a bug in a library, which frequently is fixed by using the latest version or reverting to the prior version of the library

All of the other bugs are directly related to a bug or, more frequently, a lack of understanding of how a library works. Sometimes what seems to be a bug is due to an incompatibility, for instance how the Java class structure changed that broke some AOP libraries.

Ed Griebel
  • 329
  • 1
  • 8
  • I'm curious -- which years for which languages? Back in the EGCS days before C++ was properly standardized, compiler bugs weren't so hard to find... – Charles Duffy Dec 30 '15 at 20:56
  • 3
    The more obscure the compiler, cpu or language the easier it is so find bug in the compilers (before someone else) so finding 2 in GCC C is nice :) – Surt Dec 31 '15 at 02:21
  • 1
    As it happens, I just wasted about a month presuming the problem I was having was in my gdb scripts, or my understanding of what I was examining. Eventually I got suspicious, simplified my test case, and found a design flaw in a library (libkvm), which made a kernel debugger unable to access certain addresses from a core dump. I.e. YMMV - and I'm at my happiest when I find a new bug in code upstream of me, especially something I'm using rather than developing. – Arlie Stephens Jan 08 '16 at 01:22
  • Of course it wasn't a _compiler_ bug, or even one of the more commonly used libraries. And truth to tell, I don't find bugs in those with any frequency at all. – Arlie Stephens Jan 08 '16 at 01:24
  • @ArlieStephens There's a lesson there: simplifying your test case is something you should do early on when you're failing at finding a problem. Regardless of whether the problem is yours or the other code's, that will help you narrow it down. Often, if the problem is in the other code, this will result in, "evidence and unit tests demonstrating" so. – jpmc26 Jan 19 '16 at 05:17
  • @jpmc26 as it happens, I was just trying to use the kernel debugger on my system to view page table entries. The last thing I expected was to find that it was unable to read certain kernel virtual addresses, thanks to a deficiency in the underlying library. I figured I was misunderstanding the kernel code and/or hardware formats - it got easy once I began to suspect a broken tool. But as folks here have pointed out, broken tools are less common than breakage in your own code, particularly when you are doing something new (to you). – Arlie Stephens Jan 19 '16 at 17:05
8

I think an interesting point here is that the vast majority of commercial software (and indeed open source software) licences specifically specify that you cannot trust the software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

From Microsoft Word Licence agreement

. Except for the Limited Warranty and to the maximum extent permitted by applicable law, Microsoft and its suppliers provide the Software and support services (if any) AS IS AND WITH ALL FAULTS, and hereby disclaim all other warranties and conditions, whether express, implied or statutory, including, but not limited to, any (if any) implied warranties, duties or conditions of merchantability, of fitness for a particular purpose, of reliability or availability, of accuracy or completeness of responses, of results, of workmanlike effort, of lack of viruses, and of lack of negligence, all with regard to the Software, and the provision of or failure to provide support or other services, information, software, and related content through the Software or otherwise arising out of the use of the Software.

In essence this sentence in the licence in almost every piece of software that you use specically tells you that you cannot trust the software let alone the compiler used.

Software is like a scientific theory, it is deemed to work as specified until it doesnt.

Toby Allen
  • 397
  • 1
  • 11
  • +1 for pointing out that the very licenses state that no software is perfect. – Tulains Córdova Dec 30 '15 at 11:38
  • 3
    I was gratified to note a deviation from this practice in IBM's ViaVoice for Mac. Instead of the customary "if it doesn't work, too bad" they actually said something like "The software is warranted to perform as specified." – WGroleau Dec 31 '15 at 05:36
  • 1
    A plain-language translation of this particular bit of warranty phrasing is, "This might be a piece of software, or it might be a piece of sh*t. It might work. It might not. Even if it does work it might not do what you want. Oh-by-the-way, we might have stolen bits of it from somebody else. Too bad. We've got your money and have used it to hire lots of lawyers. GAME! ON! Nyah-nyah-nyah-nyah-nyaaah-naah!". :-) – Bob Jarvis - Слава Україні Dec 31 '15 at 18:36
  • 2
    @BobJarvis: My favourite warranty statement, used on some open source software (like nmap IIRC), is "If it breaks, you get to keep both pieces". – Peter Cordes Jan 01 '16 at 14:05
  • This statement is ubiquitous in open source software, and lots of no-cost closed source software. It does not appear in most commercial paid software licenses. – jwg Jan 03 '16 at 13:55
  • @jwg Actually you will find a very similar statement in the Microsoft Word Licence Text see my updated text. – Toby Allen Jan 03 '16 at 17:26
2

As a compiler writer for a math language*, from my experience I can say in theory you cannot. And some of the bugs just give wrong results like (from my shame list) calculating 6/3*2 from the right 6/(3*2) and outputting 1 without crashing or giving nonsensical compile errors.

But IMHO many compilers do not have as many bugs as other software because:

  • Writing unit tests is easy. Each statement is a unit and you can write tests as simple as: test_unit("2+(-2)*(-2+1)*3+1",9);
  • A program is a combination of statements and for any program to output the correct result each individual statement must give the correct result (mostly). So it is very unlikely to have any bugs while the program gives the correct result.
  • As the size and number of written programs increase the likelihood of catching bugs dramatically increases.

For assemblers, machine instructions etc, the above also hold; on the other hand verification and validation in chip design and production have a lot more strict processes since it is a huge business: Electronic design automation .

Before going to production each CPU should be tested rigorously because each bug costs nearly a couple of million dollars: there are huge non-recurring production costs in chip production. So companies spend a lot of money and write a lot of simulation code for their design before going production, although this does not give a 100% guarantee - for example: the Pentium FDIV bug.

In short it is very unlikely to have serious bugs in compilers, machine codes etc.

My humble math language *

psmears
  • 188
  • 4
Gorkem
  • 137
  • 4
  • Intel tests the hell out of their CPUs by running sequences of random instructions and comparing with a software model, among other things: http://tweakers.net/reviews/740/4/chip-magicians-at-work-patching-at-45nm-logical-validation.html. This is why you often see really obscure errata published, for some really unlikely combination of instructions in an unusual mode. – Peter Cordes Jan 01 '16 at 14:12
0

Flawless? They're not. I recently installed some "updates", and it was months (and several reprogrammed sections of code) later before my ASP.NET site was working properly again, due to unexplained changes in how various basic things worked or failed.

However, they are tested and then used by many very smart detail-oriented people, who tend to notice and report and fix most things. Stack Exchange is a great example (and improvement on) how all the people using those tools help test and analyze how these amazingly complex and low-level tools work, at least as far as as practical use goes.

But flawless, no. Although you can also see people on Stack Exchange gaining impressive insight into performance details and standards compliance and quirks, there are always flaws and imperfections, especially when different people have different opinions of what a flaw is.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Dronz
  • 117
  • 5
-1

To show that the underlying systems are flawless you either

a) Need to proof they are flawless

  1. Mathematical proof
  2. Only realistically possible for trivial programs

b) Make an exhaustive test

  1. Only possible for trivial programs and some simple programs
  2. As soon as a timing element enters the test it is not possible to make an exhaustive test as time can be indefinitely divided.
  3. Beyond the trivial programs the possible execution options explode exponentially.

In software testing the exhaustive test is only used in unit testing of some simple functions.

Example: You want to test a 8 character utf-8 input to some field, you make the choice to cut the input at 8 times the maximum length 6 of utf-8 in bytes which gives 8*6=48 bytes to actually have a finite amounts of possibilities.

You could now think you only need to test the 1,112,064 valid code points of each of the 8 character, ie. 1,112,064^8 (say 10^48) tests (which is already unlikely be possible), but you actually have to test each value of each of the 48 bytes or 256^48 which is around 10^120 which is the same complexity as chess compared to the total number of atoms in the universe of roughly 10^80.

Instead you can use, in increasing order of effort and each test should cover all the previous:

a) test a good and a bad sample.

b) code coverage, ie. try to test every line of code, which is relative simple for most code. Now you can wonder what the last 1% of the code you can't test is there ... bugs, dead code, hardware exceptions etc.

c) path coverage, all outcomes of all branches in all combinations are tested. Now you know why the test department hates you when your functions contains more than 10 conditions. Also you wonder why the last 1% can't be tested ... some branches are depended on the previous branches.

d) data test, test a number of sample with border value, common problematic values and magic numbers, zero, -1, 1, min +/-1, max +/-1, 42, rnd values. If this doesn't give you path coverage you know you haven't caught all the values in your analysis.

If you already do this you should be ready for ISTQB foundation exam.

Surt
  • 123
  • 3