64

C\C++ specifications leave out a large number of behaviors open for compilers to implement in their own way. There are a number of questions that always keep getting asked here about the same and we have some excellent posts about it:

My question is not about what undefined behavior is, or is it really bad. I do know the perils and most of the relevant undefined behavior quotes from the standard, so please refrain from posting answers about how bad it is. This question is about the philosophy behind leaving out so many behaviors open for compiler implementation.

I read an excellent blog post that states that performance is the main reason. I was wondering if performance is the only criteria for allowing it, or are there any other factors which influence the decision to leaving things open for compiler implementation?

If you have any examples to cite about how a particular undefined behavior provides sufficient room for compiler to optimize, please list them. If you know of any other factors other than performance, please back your answer with sufficient detail.

If you do not understand the question or do not have sufficient evidences/sources to back your answer, please do not post broadly speculating answers.

Alok Save
  • 1,138
  • 7
  • 12
  • 7
    who ever heard of a deterministic computer anyway? – sova Aug 09 '11 at 19:16
  • 1
    as litb's excellent answer http://programmers.stackexchange.com/a/99741/192238 indicates, the title and body of this question seem a little bit mismatched: "behaviors open for compilers to implement in their own way" are usually referred to as _implementation-defined_. sure, actual UB is allowed to be defined by the implementation author, but more often than not, they don't bother (and optimise it all away, etc.) – underscore_d Feb 26 '16 at 01:50
  • Something similar to this https://softwareengineering.stackexchange.com/questions/398703/why-does-c-have-undefined-behaviour-and-other-languages-like-c-or-java-don – Sisir Sep 22 '19 at 17:51

13 Answers13

52

First, I'll note that although I only mention "C" here, the same really applies about equally to C++ as well.

The comment mentioning Godel was partly (but only partly) on point.

When you get down to it, undefined behavior in the C standards is largely just pointing out the boundary between what the standard attempts to define, and what it doesn't.

Godel's theorems (there are two) basically say that it's impossible to define a mathematical system that can be proven (by its own rules) to be both complete and consistent. You can make your rules so it can be complete (the case he dealt with was the "normal" rules for natural numbers), or else you can make it possible to prove its consistency, but you can't have both.

In the case of something like C, that doesn't apply directly -- for the most part, "provability" of the completeness or consistency of the system isn't a high priority for most language designers. At the same time, yes, they probably were influenced (to at least some degree) by knowing that it's provably impossible to define a "perfect" system -- one that's provably complete and consistent. Knowing that such a thing is impossible may have made it a bit easier to step back, breathe a little, and decide on the bounds of what they would try to define.

At the risk of (yet again) being accused of arrogance, I'd characterize the C standard as being governed (in part) by two basic ideas:

  1. The language should support as wide a variety of hardware as possible (ideally, all "sane" hardware down to some reasonable lower limit).
  2. The language should support writing as wide a variety of software as possible for the given environment.

The first means that if somebody defines a new CPU, it should be possible to provide a good, solid, usable implementation of C for that, as long as the design falls at least reasonably close to a few simple guidelines -- basically, if it follows something on the general order of the Von Neumann model, and provides at least some reasonable minimum amount of memory, that should be enough to allow a C implementation. For a "hosted" implementation (one that runs on an OS) you need to support some notion that corresponds reasonably closely to files, and have a character set with a certain minimum set of characters (91 are required).

The second means it should be possible to write code that manipulates the hardware directly, so you can write things like boot loaders, operating systems, embedded software that runs without any OS, etc. There are ultimately some limits in this respect, so nearly any practical operating system, boot loader, etc., is likely to contain at least a little bit of code written in assembly language. Likewise, even a small embedded system is likely to include at least some sort of pre-written library routines to give access to devices on the host system. Although a precise boundary is difficult to define, the intent is that the dependency on such code should be kept to a minimum.

The undefined behavior in the language is largely driven by the intent for the language to support these capabilities. For example, the language allows you to convert an arbitrary integer to a pointer, and access whatever happens to be at that address. The standard makes no attempt at saying what will happen when you do (e.g., even reading from some addresses can have externally visible affects). At the same time, it makes no attempt at preventing you from doing such things, because you need to for some kinds of software you're supposed to be able to write in C.

There is some undefined behavior driven by other design elements as well. For example, one other intent of C is to support separate compilation. This means (for example) that it's intended that you can "link" pieces together using a linker that follows roughly what most of us see as the usual model of a linker. In particular, it should be possible to combine separately compiled modules into a complete program without knowledge of the semantics of the language.

There is another type of undefined behavior (that's much more common in C++ than C), which is present simply because of the limits on compiler technology -- things that we basically know are errors, and would probably like the compiler to diagnose as errors, but given the current limits on compiler technology, it's doubtful that they could be diagnosed under all circumstances. Many of these are driven by the other requirements, such as for separate compilation, so it's largely a matter of balancing conflicting requirements, in which case the committee has generally opted to support greater capabilities, even if that means lack of diagnosing some possible problems, rather than limiting the capabilities to ensure that all possible problems are diagnosed.

These differences in intent drive most of the differences between C and something like Java or a Microsoft's CLI-based systems. The latter are fairly explicitly limited to working with a much more limited set of hardware, or requiring software to emulate the more specific hardware they target. They also specifically intend to prevent any direct manipulation of hardware, instead requiring that you use something like JNI or P/Invoke (and code written in something like C) to even make such an attempt.

Going back to Godel's theorems for a moment, we can draw something of a parallel: Java and CLI have opted for the "internally consistent" alternative, while C has opted for the "complete" alternative. Of course, this is a very rough analogy -- I doubt anybody's attempting a formal proof of either internal consistency or completeness in either case. Nonetheless, the general notion does fit fairly closely with the choices they've taken.

Jerry Coffin
  • 44,385
  • 5
  • 89
  • 162
  • 26
    I think Godel's Theorems are a red herring. They deal with proving a system from its own axioms, which is not the case here: C does not need to be specified in C. It is quite possible to have an completely specified language (consider a Turing machine). – poolie Aug 10 '11 at 02:48
  • 9
    Sorry, but I fear you've completely misunderstood Godel's Theorems. They deal with the impossibility of proving all true statements in a consistent system of logic; in terms of computing, the incompleteness theorem is analogous to saying that there are problems that cannot be solved by any program - problems are analogous to true statements, programs to proofs and the model of computation to the logic system. It has no connection at all to undefined behaviour. See for an explanation of the analogy here: http://www.scottaaronson.com/blog/?p=710. – Alex ten Brink Aug 10 '11 at 09:57
  • 5
    I should note that a Von Neumann machine is not required for a C implementation. It's perfectly possible (and not even very difficult) to develop a C implementation for a Harvard architecture (and I wouldn't be surprised to see a lot of such implementations on embedded systems) – bdonlan Aug 30 '11 at 03:53
  • 3
    Unfortunately, modern C compiler philosophy takes UB to a whole new level. Even in cases where a program was prepared to deal with almost all plausible "natural" consequences from a particular form of Undefined Behavior, and those it couldn't deal with would at least be recognizable (e.g. trapped integer overflow), the new philosophy favors bypassing any code which couldn't execute unless UB was going to occur, turning code which would have behaved correctly on any most implementation into code which is "more efficient" but just plain wrong. – supercat Apr 09 '15 at 19:47
23

The C rationale explains

The terms unspecified behavior, undefined behavior, and implementation-defined behavior are used to categorize the result of writing programs whose properties the Standard does not, or cannot, completely describe. The goal of adopting this categorization is to allow a certain variety among implementations which permits quality of implementation to be an active force in the marketplace as well as to allow certain popular extensions, without removing the cachet of conformance to the Standard. Appendix F to the Standard catalogs those behaviors which fall into one of these three categories.

Unspecified behavior gives the implementor some latitude in translating programs. This latitude does not extend as far as failing to translate the program.

Undefined behavior gives the implementor license not to catch certain program errors that are difficult to diagnose. It also identifies areas of possible conforming language extension: the implementor may augment the language by providing a definition of the officially undefined behavior.

Implementation-defined behavior gives an implementor the freedom to choose the appropriate approach, but requires that this choice be explained to the user. Behaviors designated as implementation-defined are generally those in which a user could make meaningful coding decisions based on the implementation definition. Implementors should bear in mind this criterion when deciding how extensive an implementation definition ought to be. As with unspecified behavior, simply failing to translate the source containing the implementation-defined behavior is not an adequate response.

Important is also the benefit for programs, not only the benefit for implementations. A program that depends on undefined behavior can still be conforming, if it is accepted by a conforming implementation. The existence of undefined behavior allows a program to use non-portable features explicitly marked as such ("undefined behavior"), without becoming non-conforming. The rationale notes:

C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the Committee did not want to force programmers into writing portably, to preclude the use of C as a ``high-level assembler'': the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§1.7).

And at 1.7 it notes

The three-fold definition of compliance is used to broaden the population of conforming programs and distinguish between conforming programs using a single implementation and portable conforming programs.

A strictly conforming program is another term for a maximally portable program. The goal is to give the programmer a fighting chance to make powerful C programs that are also highly portable, without demeaning perfectly useful C programs that happen not to be portable. Thus the adverb strictly.

Thus, this little dirty program that works perfectly fine on GCC is still conforming!

Johannes Schaub - litb
  • 1,011
  • 1
  • 9
  • 16
16

The speed thing is especially a problem when compared to C. If C++ did some things that might be sensible, like initializing large arrays of primitive types, it would lose a ton of benchmarks to C code. So C++ initializes its own data types, but leaves the C types the way they were.

Other undefined behavior just reflects reality. One example is bit-shifting with a count larger than the type. That actually differs between hardware generations of the same family. If you have a 16-bit app, the exact same binary will give different results on an 80286 and an 80386. So the language standard says that we don't know!

Some things are just kept the way they were, like the order of evaluation of subexpressions being unspecified. Originally this was believed to help compiler writers optimize better. Nowadays the compilers are good enough to figure it out anyway, but the cost of finding all places in existing compilers that take advantage of the freedom is just too high.

Bo Persson
  • 381
  • 1
  • 14
  • +1 for the second paragraph, which shows something that would be awkward to have specified as implementation-defined behavior. – David Thornley Aug 09 '11 at 15:59
  • 3
    The bit shift just an example of accepting undefined compiler behaviour and using the hardware capabilites. It would be trivial to specify a C result for a bit shift when the count is larger than type, but expensive to implement on some hardware. – mattnz Aug 09 '11 at 21:24
  • @mattnz: How many hardware platforms are there where it would cost anything to specify that oversized shifts may either choose in Unspecified fashion between shifting by the specified value or a mod-reduced value, or may behave in some Implementation-Defined alternative fashion should they choose to document one? – supercat Aug 06 '21 at 19:49
  • Read the excellent answer by @Jerry Coffin. – mattnz Aug 06 '21 at 22:44
  • why does a bit shift have to have the same semantics on every platform? why can't it just be very-well-defined-by-the-compiler-for-each-platform instead of undefined-so-let's-delete-user's-code dumb idea that we have now? – capr Apr 01 '23 at 21:55
7

As one example, pointer accesses almost have to be undefined and not necessarily just for performance reasons. For example, on some systems, loading specific registers with a pointer will generate a hardware exception. On SPARC accesssing an improperly aligned memory object will cause a bus error, but on x86 it would "just" be slow. It's tricky to actually specify behavior in those cases since the underlying hardware dictates what will happen, and C++ is portable to so many types of hardware.

Of course it also gives the compiler freedom to use architecture specific knowledge. For an unspecified behavior example, right shift of signed values may be logical or arithmetic depending on the underlying hardware, to allow for using whichever shift operation is available and not forcing software emulation of it.

I believe also it makes the compiler-writer's job rather easier but I can't recall the example just now. I'll add it if I recall the situation.

Mark B
  • 751
  • 4
  • 3
  • 3
    The C language could have been specified such that it always had to use byte-by-byte reads on systems with alignment restrictions, and such that it had to provide exception traps with well-defined behavior for invalid address accesses. But of course this all would have been incredibly costly (in code size, complexity, and performance) and would have offered no benefits whatsoever to sane, correct code. – R.. GitHub STOP HELPING ICE Aug 09 '11 at 17:49
  • It does make the job easier yes. This has been noted in different places and if you think about it it makes sense. The fewer requirements one has in something the easier it will be to fulfil all the requirements. That's not to say that the standard could probably be more specific in some places but it definitely makes it easier for compiler writers. It probably also makes it easier for the standard writing body as well. – Pryftan Mar 18 '20 at 17:20
  • @R..GitHubSTOPHELPINGICE: Requiring byte-by-byte reads on systems with alignment restrictions would have made the language unsuitable for many tasks. Better would have been to specify that a compiler may in general choose in Unspecified fashion between performing a read of a word-sized object by using a platform's natural method for such reads, if it has one, with whatever consequences result, or by performing a sequence of smaller reads, or reusing a previous value that was written to or read from that storage, or simply doing nothing if the value of the read would be ignored. – supercat Aug 09 '21 at 16:58
6

Simple: Speed, and portability. If C++ guaranteed that you got an exception when you de-reference an invalid pointer, then it wouldn't be portable to embedded hardware. If C++ guaranteed some other things like always initialized primitives, then it would be slower, and in the time of origin of C++, slower was a really, really bad thing.

DeadMG
  • 36,794
  • 8
  • 70
  • 139
  • 1
    Huh? What do exceptions have to do with embedded hardware? – Mason Wheeler Aug 09 '11 at 16:09
  • 2
    Exceptions may lock up the system in ways that are very bad for Embedded Systems that need to respond quickly. There are situations where a false reading is much less damaging that a slowed system. –  Aug 09 '11 at 16:13
  • 1
    @Mason: Because the hardware has to catch the invalid access. It's easy for Windows to throw an access violation, and harder for embedded hardware with no operating system to do anything except die. – DeadMG Aug 09 '11 at 16:14
  • 3
    Also remember that not every CPU has an MMU to guard against invalid accesses in hardware to begin with. If you start requiring your language to check all pointer accesses, then you have to emulate an MMU on CPUs without one - and thus EVERY memory access becomes extremely expensive. – fluffy Aug 10 '11 at 00:19
  • @fluffy: On the flip side, if a compiler requires that programmers manually check pointers for validity rather than allowing them to benefit from memory-protection hardware that would provide zero-overhead null-pointer checks automatically, one will be back to having machine code include explicit pointer checks. – supercat Aug 06 '21 at 19:55
  • @supercat Yes, but well-written code doesn't require those checks, and static analysis tools help even more. Also in this context, what does "checking a pointer for validity" even mean? – fluffy Aug 07 '21 at 20:48
  • @fluffy: Following e.g. a `malloc()` call, testing for validity would mean checking that a pointer isn't null. Likewise when performing a typical hash-table lookup function that is supposed to return a pointer to an item that should exist in a table unless source data gets changed unexpectedly (something that shouldn't happen in properly-functioning code, but could happen in a program accesses data from address space shared with defective or untrustworthy code). If a platform spec guarantees that a null reference will be trapped, omitting checks for whether a reference is null... – supercat Aug 08 '21 at 00:45
  • ...that occur after an *actual* access could be a useful optimization, but if e.g. code reads from an address and omits the result, omitting the read should negate any presumption that the null check couldn't be reached if the pointer was null. The Standard's abstract machine is designed to exclude all features and guarantees that might be impractical to support on some implementations, but if hardware can guarantee that a null pointer access will trap, a quality implementation should guarantee that it will at worst either trap in controlled fashion *or have no side effects*, ... – supercat Aug 08 '21 at 00:48
  • ...with the choice possibly being made in Unspecified fashion. If an implementation guarantees that, and programmers can exploit it, that may allow many tasks to be done in ways meeting requirements more efficiently than would be possible without any sort of guarantee whatsoever. – supercat Aug 08 '21 at 00:50
  • @supercat Making guarantees interferes with compiler optimisations based on not having to make guarantees, so there won't really be better efficiency. A quality implementation offers the best efficiency. Of course, one could make an argument that saving programmer time means implementing more efficient designs which is worth more, but that's a different debate. – DeadMG Aug 08 '21 at 15:18
  • @DeadMG: Optimal efficiency will generally be obtained by offering the loosest guarantees *sufficient to meet application requirements*. Implementations specialized for tasks with unusually loose requirements may benefit from optimizations that would be inappropriate for most other tasks. If a machine-language program which would allow maliciously crafted data to execute code of its authors choosing could process valid data faster than would be possible with one that would not allow arbitrary code execution, an optimizer that would generate such machine code may be... – supercat Aug 08 '21 at 19:31
  • ...useful for tasks that would never involve processing invalid data. For tasks which would involve processing data from untrustworthy sources, however, the fastest code that could process invalid data in a manner which would be guaranteed to have no side effects would be more efficient than the fastest possible code that validates all data. Saying that the only way to prevent remote-code-execution exploits is to mandate that all input be processed in fully deterministic fashion would block generation of the most efficient machine code *that would meet application requirements*. – supercat Aug 08 '21 at 19:39
  • @supercat But you can always write correct code that meets application requirements in the current system, which is to say, just don't invoke UB, which frankly should be pretty easily done. Thus we receive optimal efficiency and we can write code that meets our application requirements; the best of both worlds. There is nothing stopping you from checking for nulls whenever you like, even literally all the time if that's what you want, or just not infecting your system with them, e.g. malloc terminating on OOM in many environments instead of returning null. – DeadMG Aug 08 '21 at 22:20
  • @DeadMG: How would you write the most efficient function `int mulComp(int a, int b, long long c)` that is required to computes `a*b < c` in cases where the mathematical product of `a` and `b` would be representable as `int`, and return either 0 or 1 without side effects otherwise? A compiler that guarantees integer overflow will have no side effects could substitute either `(long long)a*b < c` or `(int)(1u*a*b) < c` at its leisure, and depending upon circumstance either of those could be faster than the other by an arbitrary amount. – supercat Aug 08 '21 at 22:43
  • @DeadMG: If the function would be called in some cases where `a` and `b` were known to be equal while `c` was zero, and in others where `c` would be known to exceed `INT_MAX`, a compiler that guaranteed integer overflow would have no side effects, thus allowing the function to be written as simply `a*b < c` would be able to treat the cases where `a` and `b` are equal and `c` is zero as yielding a constant 0 with no dependency on the particular value that's in `a` and `b`, and the cases where `c` exceeds `INT_MAX` as a constant 1 without regard for what's in `a` and `b`. – supercat Aug 08 '21 at 22:50
  • @DeadMG: If a programmer had to avoid integer overflow at all costs, however, there would be no way for a compiler to generate the most efficient machine code that would meet programmer requirements, because the programmer would have to specify a particular means of handling cases where overflow would occur, rather than letting the compiler pick the most efficient means that would meet the "side-effect-free" requirement. – supercat Aug 08 '21 at 22:52
  • @supercat But that's the thing about the Standard, which is that the compiler *always* can replace that code with machine-specific code under as-if. If I check for overflow, and then return 0 or 1 in that case, then the compiler can substitute a multiplication at hardware level and check the overflow bit. There is nothing preventing the compiler from optimising in this case which creates the equivalent machine code. Moreover, cherry-picking examples does not help because the Standard must pick rules that work for most cases, and the compiler must optimise every function, not just this one. – DeadMG Aug 10 '21 at 08:08
  • There is of course also nothing preventing an implementation from offering intrinsics, builtins, switches, or library functions which offer hardware-specific functionality if such is available, which many do. It is not required of the Standard to offer such things. – DeadMG Aug 10 '21 at 08:10
  • @DeadMG: The Standard has never sought to specify everything an implementation must do to be suitable for any particular purpose; C is useful because most implementations, as a form of "conforming language extension", specify how they will behave in circumstances where the language imposes no requirements, and it allows Conforming C Programs (though not Strictly Conforming ones) to exploit such extensions. The Standard allows implementations to deviate from commonplace behaviors in cases where doing so would make them more useful for their intended purpose, but ... – supercat Aug 10 '21 at 14:53
  • ... that does not imply any judgment that such deviation would not render implementations unsuitable for most other purposes. – supercat Aug 10 '21 at 14:53
  • @DeadMG: Note also an important corollary of the "as-if" rule: the only way the Standard can allow a potentially-useful optimizing transform is to classify as Undefined Behavior any action that would make the effects of that transform visible. If integer-overflow were Implementation Defined (as opposed to what K&R calls "machine dependent"), the only way to allow integer operations to be reordered on platforms where it would raise a signal would be to classify as Undefined Behavior all situations where such a signal might be raised. – supercat Aug 10 '21 at 15:11
  • @DeadMG: Incidentally, have you read the published Rationale document for the C Standard? The authors clearly expected that commonplace implementations would process at least some cases involving integer overflow meaningfully without regard for whether or not the Standard would require them to do so. – supercat Aug 10 '21 at 16:11
4

C was invented on a machine with 9bit bytes and no floating point unit - suppose it had mandated that bytes be 9bits, words 18bits and that floats should be implemented using pre IEEE754 aritmatic?

Martin Beckett
  • 15,776
  • 3
  • 42
  • 69
  • 5
    I suspect you're thinking of Unix -- C was originally used on the PDP-11, which was actually pretty conventional current standards. I think the basic idea stands nonetheless. – Jerry Coffin Aug 09 '11 at 18:44
  • @Jerry - yes, you're right - I'm getting old ! – Martin Beckett Aug 09 '11 at 19:01
  • 1
    Yup -- happens to the best of us, I'm afraid. – Jerry Coffin Aug 09 '11 at 19:08
  • Okay but this really isn't about undefined behaviour but instead more like unspecified or maybe implementation defined. Also if you use your reasoning then what about new architectures? What we have today is very different from even the early 1990s. And that was quite different from earlier too. So you're right but there are differences between undefined and unspecified. – Pryftan Mar 18 '20 at 17:57
4

One of the early classic cases was signed integer addition. On some of the processors in use, that would cause a fault, and on others it would just continue on with a value (likely the appropriate modular value). Specifying either case would mean that programs for machines with the unfavored arithmetic style would have to have extra code, including a conditional branch, for something as similar as integer addition.

David Thornley
  • 20,238
  • 2
  • 55
  • 82
  • Integer addition is an interesting case; beyond the possibility of trap behavior which in some cases would be useful but could in other cases cause random code execution, there are situations where it would be reasonable for a compiler to make inferences based upon the fact that integer overflow is not specified to wrap. For example, a compiler where `int` is 16 bits and sign-extended shifts are expensive could compute `(uchar1*uchar2) >> 4` using a non-sign-extended shift. Unfortunately, some compilers extend inferences not just to results, but to the operands. – supercat Apr 14 '15 at 17:19
4

I don't think the first rationale for UB was to let room to the compiler to optimize, but just the possibility to use the obvious implementation for the targets at a time when architectures had more variety than now (remember if C was designed on a PDP-11 which has a somewhat familiar architecture, the first port was to Honeywell 635 which is far less familiar -- word addressable, using 36 bit words, 6 or 9 bits bytes, 18 bits addresses... well at least it used 2's complement). But if heavy optimization wasn't a target, obvious implementation doesn't include adding run-time checks for overflow, the shift count over the register size, that aliases in expressions modifying multiple values.

Another thing taken into account was ease of implementation. A C compiler at the time was multiple passes using multiple process because having one process handle everything would not have been possible (the program would have been too large). Asking heavy coherence check was out of the way -- especially when it involved several CU. (Another program than the C compilers, lint, was used for that).

AProgrammer
  • 10,404
  • 1
  • 30
  • 45
  • 2
    I wonder what drove the changing philosophy of UB from "Allow programmers to use behaviors exposed by their platform" to "Find excuses to let compilers to implement totally wacky behavior"? I also wonder how much such optimizations end up improving code size after code is modified to work under the new compiler? I wouldn't be surprised if in many cases the only effect of adding such "optimizations" to the compiler is to force programmers to write bigger and slower code so as to avoid having the compiler break it. – supercat Apr 14 '15 at 17:15
  • It's a drift in POV. People became less aware of the machine on which their program runs, they became more concerned with portability so they avoided depending on undefined, unspecified and implementation defined behavior. There was pressure on optimizers to get the best results on benchmark, and that means making use of every leniency left by the spec of the languages. There is also the fact that Internet -- Usenet at a time, SE nowadays -- language lawyers also tend to give a biased view of the underlying rationale and behavior of compiler writers. – AProgrammer Apr 15 '15 at 07:49
  • 2
    What I find curious is statements I've seen to the effect of "C assumes that programmers will never engage in undefined behavior"--a fact which has historically not been true. A correct statement would have been "C assumed that programmers would not trigger behavior undefined by the standard unless prepared to deal with *the natural platform consequences* of that behavior. Given that C was designed as a systems-programming language, a big part of its *purpose* was to allow programmers to do system-specific things not defined by the language standard; the idea that they'd never do so is absurd. – supercat Apr 15 '15 at 14:43
  • It's good for programmers to go through extra efforts to ensure portability *in cases where different platforms would inherently do different things*, but compiler writers waste everyone's time when they eliminate behaviors which programmers historically could have reasonably expected to be common to all future compilers. Given integers `i` and `n`, such that `n < INT_BITS` and `i*(1< – supercat Apr 15 '15 at 14:53
  • While I think it would be good for the standard to allow traps for many things it calls UB (e.g. integer overflow), and there are good reasons for it not to require that traps do anything predictable, I would think that from every standpoint imaginable the standard would be improved if it required that most forms of UB must either yield indeterminate value or document the fact that they reserve the right to do something else, without being absolutely required to document what that something else might be. Compilers which made everything "UB" would be legal, but likely disfavored... – supercat Apr 15 '15 at 14:58
  • 1
    ...over those which could offer at least some form of guarantees. To my mind, given `int blah(uint16_t x1) { if (x1 < 50000) foo(x1); return x1*65536;}` having a trap when `x1` is greater than 32767 may be preferable to having it silently overflow, and if a compiler could determine that `foo()` could not prevent the Undefined Behavior (e.g. by doing a `longjmp`) I would have no problem with `blah(32768)` trapping before calling `foo`, but only if trap behavior is totally undefined would it be legitimate to call `foo` with a value of 50000. – supercat Apr 15 '15 at 15:15
  • @supercat A most curious point yes. Laziness comes to mind. But then can it be laziness on the standard body when they add so many new keywords etc.? I tend to not have optimisations enabled and I don't think it makes such a big effect but that's probably also bias. As for the statement you refer to. The very idea behind it is ludicrous isn't it? If they specify a thing as UB then they must anticipate it happening so how can they expect otherwise? What might be more helpful is for more clarity but I don't see that ever happening. It would sure make things easier though. I like your idea too. – Pryftan Mar 18 '20 at 17:40
  • 1
    @Pryftan: Remember that the Standard was written after the language was already in wide use. Saying "the Standard imposes no requirements" was a catch-all for situations in which it would be useful to have at least compilers behave in different fashion. While it would have been better for the Standard to offer recommendations, I think that would have been seen as showing favoritism toward some architectures; C89 instead seems to go out of its way to avoid "recommending" anything that it doesn't mandate. – supercat Mar 18 '20 at 21:53
2

Historically, Undefined Behavior had two primary purposes:

  1. To avoid requiring compiler authors to generate code to handle conditions which were never supposed to occur.

  2. To allow for the possibility that in the absence of code to explicitly handle such conditions, implementations may have various kinds of "natural" behaviors which would, in some cases, be useful.

As a simple example, on some hardware platforms, attempting to add together two positive signed integers whose sum is too large to fit in a signed integer will yield a particular negative signed integer. On other implementations it will trigger a processor trap. For the C standard to mandate either behavior would require that compilers for platforms whose natural behavior differed from the standard would have to generate extra code to yield the correct behavior--code which may be more expensive than the code to do the actual addition. Worse, it would mean that programmers who wanted the "natural" behavior would have to add even more extra code to achieve it (and that extra code would again be more expensive than the addition).

Unfortunately, some compiler authors have taken the philosophy that compilers should go out of their way to find conditions that would evoke Undefined Behavior and, presuming that such situations may never occur, draw extended inferences from that. Thus, on a system with 32-bit int, given code like:

uint32_t foo(uint16_t q, int *p)
{
  if (q > 46340)
    *p++;
  return q*q;
}

the C standard would allow the compiler to say that if q is 46341 or larger, the expression q*q will yield a result too large to fit in an int, consequently causing Undefined Behavior, and as a result the compiler would be entitled to assume that can't happen and thus would not be required to increment *p if it does. If the calling code uses *p as an indicator that it should discard the results of the computation, the effect of the optimization may be to take code which would have yielded sensible results on systems that perform in almost any imaginable way with integer overflow (trapping may be ugly, but would at least be sensible), and turned it into code which may behave nonsensically.

supercat
  • 8,335
  • 22
  • 28
  • And unfortunately using even negative or less ideal things will reinforce the belief - and the very system -and therefore only encourage it or at the very best not discourage it. Thus there's a loop that the standard and the compilers go round and round, where each time it gets bigger. – Pryftan Mar 18 '20 at 18:12
  • A bigger issue is that the authors of the Standard decided, when it was written, that that because implementations were using processing certain constructs that had no universal meaning in a variety of useful ways that were suitable for various tasks, there was no need for the Standard to provide ways of doing those tasks. For the Standard to officially recognize constructs which some implementations have supported for decades, but some optimizers' authors have doubled down on refusing to support except by disabling optimizations, would be to acknowledge that there was never a good reason... – supercat Mar 18 '20 at 23:52
  • ...for the optimizers' authors not to have supported such constructs decades ago. – supercat Mar 18 '20 at 23:53
2

I'd say it was less about philosophy than it was about reality -- C has always been a cross platform language, and the standard has to reflect that and the fact that at the time any standard is released, there are going to be a large number of implementations on a lot of different hardware. A standard that forbid necessary behavior would either be disregarded or produce a competing standards body.

jmoreno
  • 10,640
  • 1
  • 31
  • 48
  • 2
    Originally, many behaviors were left undefined to allow for the possibility that different systems would do different things, including trigger a hardware trap with a handler that may or may not be configurable (and might, if not configured, cause arbitrarily-unpredictable behavior). Requiring that a left-shift of a negative value *not* trap, for example, would break any code which was designed for a system where it did and relied upon such behavior. In short, they were left undefined *so as not to prevent implementers from provide behaviors they thought were useful*. – supercat Apr 14 '15 at 16:59
  • 2
    Unfortunately, however, that has been twisted around such that even code which knows that it's running on a processor that would do something useful in a particular case can't take advantage of such behavior, because compilers may use the fact that the C standard doesn't specify the behavior (although the platform would) to apply bizarro-world rewrites to the code. – supercat Apr 14 '15 at 17:05
1

Some behaviors cannot be defined by any reasonable means. I mean accessing a deleted pointer. The only way to detect it would be banning pointer value after deletion (memorizing its value somewhere and not allowing any allocation function return it anymore). Not only such a memorization would be overkill, but for a long running program would cause running out of allowed pointers values.

  • 2
    or you could allocate all pointers as `weak_ptr` and nullify all references to a pointer that gets `delete`d... oh wait, we're approaching garbage collection :/ – Matthieu M. Aug 09 '11 at 17:29
  • 1
    `boost::weak_ptr`'s implementation is a pretty good template to start with for this usage pattern. Rather than tracking and nullifying `weak_ptrs` externally, a `weak_ptr` just contributes to the `shared_ptr`'s weak count, and the weak count is basically a refcount to the pointer itself. Thus, you can nullify the `shared_ptr` without having to delete it immediately. It's not perfect (you can still have lots of expired `weak_ptr`s maintaining the underlying `shared_count` for no good reason) but at least it's fast and efficient. – fluffy Aug 22 '11 at 21:31
  • Not true on the only way of detecting it. I've made a pointer validity and linked list tracking system for a specific project of mine to do exactly what you suggest is impossible. Well okay you have to make use of it but it's still possible. And if other languages can do this then so too can C. So your argument doesn't really hold - or at least your example does not hold. – Pryftan Mar 18 '20 at 18:02
0

I'll give you an example where there's pretty much no sensible choice other than undefined behavior. In principle, any pointer could point to the memory containing any variable, with the small exception of local variables that the compiler can know have never had their address taken. However, to get acceptable performance on a modern CPU, a compiler must copy variable values into registers. Operating entirely out of memory is a non-starter.

This basically gives you two choices:

1) Flush everything out of registers before any access through a pointer, just in case the pointer points to that particular variable's memory. Then load everything needed back into register, just in case the values were changed through the pointer.

2) Have a set of rules for when a pointer is allowed to alias a variable and when the compiler is permitted to assume that a pointer does not alias a variable.

C opts for option 2, because 1 would be terrible for performance. But then, what happens if a pointer aliases a variable in a way the C rules prohibit? Since the effect depends on whether the compiler did in fact store the variable in a register, there's no way for the C standard to definitively guarantee specific results.

David Schwartz
  • 4,676
  • 22
  • 26
  • 2
    There would be a semantic difference between saying "A compiler is allowed to behave as though X is true" and saying "Any program where X is not true will engage in Undefined Behavior", though unfortunately the standards to not make the distinction clear. In many situations, including your aliasing example, the former statement would allow many compiler optimizations that would be impossible otherwise; the latter allows some more "optimizations", but many of the latter optimizations are things programmers wouldn't want. – supercat Apr 09 '15 at 18:48
  • 2
    For example, if some code sets a `foo` to 42, and then calls a method that uses an illegitimately-modified pointer to set `foo` to 44, I can see benefit to saying that until the next "legitimate" write of `foo`, attempts to read it may legitimately yield 42 or 44, and an expression like `foo+foo` could even yield 86, but I see far less benefit to allowing the compiler to make extended and even retroactive inferences, changing Undefined Behavior whose plausible "natural" behaviors would all have been benign, into a license to generate nonsensical code. – supercat Apr 09 '15 at 18:58
-6

Efficiency is the usual excuse, but whatever the excuse, undefined behavior is a terrible idea for portability. In effect undefined behaviors become unverified, unstated assumptions.

ddyer
  • 4,060
  • 15
  • 18
  • 7
    The OP specified this: "My question is not about what undefined behavior is, or is it really bad. I do know the perils and most of the relevant undefined behavior quotes from the standard, so please refrain from posting answers about how bad it is." Looks like you didn't read the question. – Etienne de Martel Aug 09 '11 at 18:45
  • In addition to the other comment I would add that in some ways it can help with portability. If you don't see how then you're not thinking it completely through. The trouble is that there isn't as much clarity as there should be and maybe too many things are undefined. Even if you exclude portability - say go with your reasoning - one could still say it's easier for the compiler writers since they would have fewer requirements to fulfil to be compliant with the standard. That itself in a way could make some portability issues easier to deal with from a compiler's perspective. – Pryftan Mar 18 '20 at 18:09