58

This Stack Overflow post lists a fairly comprehensive list of situations where the C/C++ language specification declares as to be 'undefined behaviour'. However, I want to understand why other modern languages, like C# or Java, doesn't have the concept of 'undefined behavior'. Does it mean, the compiler designer can control all possible scenarios (C# and Java) or not (C and C++)?

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Sisir
  • 828
  • 1
  • 7
  • 17
  • 9
    See also [Code with undefined behavior in C#](https://stackoverflow.com/questions/1860615/code-with-undefined-behavior-in-c-sharp), [Undefined behaviour in Java](https://softwareengineering.stackexchange.com/questions/153843/undefined-behaviour-in-java) and [What are the common undefined behaviours that Java Programmers should know about](https://stackoverflow.com/questions/376338/what-are-the-common-undefined-behaviours-that-java-programmers-should-know-about). – Theraot Sep 21 '19 at 12:27
  • 2
    see [Is asking “why” on language specs still considered as “primary opinion-based” if it can have official answers?](https://meta.stackoverflow.com/a/323382/839601) – gnat Sep 21 '19 at 16:23
  • 3
    and yet [this SO](https://softwareengineering.stackexchange.com/questions/153843/undefined-behaviour-in-java?noredirect=1&lq=1) post refers to undefined behaviour even in the Java spec! – gbjbaanb Sep 22 '19 at 11:47
  • _"Why does C++ have 'Undefined Behaviour'"_ Unfortunately, this seems to be one of those questions that's difficult to answer objectively, beyond the statement "because, for reasons X, Y, and/or Z (all of which may be `nullptr`) no one bothered to define the behavior by writing and/or adopting a proposed specification". :c – code_dredd Sep 22 '19 at 21:36
  • I'd challenge the premise. At least C# has ["unsafe" code.](https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/language-specification/unsafe-code) Microsoft writes "In a sense, writing unsafe code is much like writing C code within a C# program" and gives example reasons why one would want to do so: in order to access hardware or the OS and for speed. This is what C was invented for (hell, they *wrote* the OS in C!), so there you have it. – Peter - Reinstate Monica Sep 23 '19 at 05:20
  • The designers of C and C++ realized a programmer has to check for *intended behavior* anyhow (there's not much a language can provide to help here) - so they very probably argued that asking them to check for *defined behavior* is only just a little bit more.. – tofro Sep 23 '19 at 10:47
  • Common Lisp has many examples of undefined behavior, but it also has "safe" mode that detects most of them at run time. – Barmar Sep 23 '19 at 14:17

11 Answers11

112

Basically because the designers of Java and similar languages didn't want undefined behavior in their language. This was a trade off - allowing undefined behavior has the potential to improve performance, but the language designers prioritized safety and predictability higher.

For example, if you allocate an array in C, the data is unspecified. In Java, all bytes must be initialized to 0 (or some other specified value). This means the runtime must pass over the array (an O(n) operation), while C can perform the allocation in an instant. So C will always be faster for such operations.

If the code using the array is going to populate it anyway before reading, this is basically wasted effort for Java. But in the case where the code read first, you get predictable results in Java but unpredictable results in C.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • 20
    Excellent presentation of the HLL dilemma: safety and ease of use vs. performance. There is no silver bullet: there are use cases for each side. – Christophe Sep 21 '19 at 15:15
  • 9
    @Christophe To be fair, there are much better approaches to a problem than letting UB go totally uncontested like C and C++. You could have a safe, managed language, with escape hatches into unsafe territory, for you to apply where beneficial. TBH, it'd be really nice to just be able to compile my C/C++ program with a flag that says "insert whatever expensive runtime machinery you need, I don't care, but just tell me about ALL of the UB that occurs." – Alexander Sep 22 '19 at 00:42
  • 4
    A good example of a data structure that _deliberately_ reads uninitialized locations is Briggs and Torczon's sparse set representation (e.g. see http://codingplayground.blogspot.com/2009/03/sparse-sets-with-o1-insert-delete.html ) Initialization of such a set is O(1) in C, but O(n) with Java's forced initialization. – Arch D. Robison Sep 22 '19 at 03:33
  • 1
    @ArchD.Robison: It's not legal to read uninitialized locations in either language. In Java, it's not possible, because all locations are initialized before the programmer gets to access them, and in C, reading from those locations produces undefined behavior which may break Briggs and Torczon's design. – Ben Voigt Sep 22 '19 at 18:52
  • 5
    Strictly, the runtime doesn't *have* to initialize the array when it's created. It merely has to ensure that reading the array gives zeroes. This is difficult if the variable is not local to some very small scope, but in principle, the runtime is not constrained to in-order execution and may lazily initialize things if it can prove that it is safe to do so. – Kevin Sep 22 '19 at 20:00
  • 9
    While it is true that forcing initialization of data makes broken programs much more predictable, it does not guarantee intended behavior: If the algorithm expects to be reading meaningful data while erroneously reading the implicitly initialized zero, that is as much a bug as if it had read some garbage. With a C/C++ program such a bug would be visible by running the process under `valgrind`, which would show exactly where the uninitialized value was used. You can't use `valgrind` on java code because the runtime does the initialization, making `valgrind`s checks useless. – cmaster - reinstate monica Sep 22 '19 at 20:31
  • @Alexander Yes, but that's *another* trade-off. Languages that can be safe by default but unsafe when required are even more complex than either in isolation. C# gives you a lot more "low level" capability than Java, but it does introduce new possibilities of shooting yourself in the foot - and there's no real way to isolate the safe code from the unsafe code; if you write the unsafe code wrong, the whole process is potentially broken. – Luaan Sep 23 '19 at 07:02
  • 5
    @cmaster Which is why the C# compiler doesn't allow you to read from uninitialized locals. No need for runtime checks, no need for initialization, just compile-time analysis. It's still a trade-off, though - there are some cases where you don't have a good way to handle branching around potentially unassigned locals. In practice, I haven't found any cases where this wasn't a bad design in the first place and better solved through rethinking the code to avoid the complicated branching (which is hard for humans to parse), but it's at least possible. – Luaan Sep 23 '19 at 07:05
  • @Alexander There's nothing stopping a compiler from offering an option that does that, and you can also use tools like `valgrind` to detect memory management errors at runtime. In C this has typically been the purview of add-on tools like `lint`. – Barmar Sep 23 '19 at 14:22
  • @Barmar Actually, no. A C++ compiler can't catch all UB. Say, for instance, I have an array of `struct`s containing an array of `int`s. I take a pointer to one of the `int`s and pass it to some function. The function does some pointer arithmetic and accesses an `int` that is out of bounds to the local `struct`, but happens to lie within the `int` array within another `struct`. The access is of type `int` on data that's initialized as `int`, but it's UB nevertheless. In order to check for all UB, you must remove pointer arithmetic from the language. – cmaster - reinstate monica Sep 23 '19 at 14:38
  • 4
    @cmaster I don't think so. The object is the array, not the struct. You can't tell whether the behavior is defined statically, but it can be detected dynamically, and the compiler could generate valgrind-like code to do this. – Barmar Sep 23 '19 at 14:42
  • @Barmar For this to work, a pointer would need to additionally store the range of allowed pointer arithmetic. Now consider this example: `struct Foo { int a; }; Foo bar[42]; int* baz = &bar[7].a;` The `int` is the first element of the struct, so I can legally derive the object pointer `Foo* object = (Foo*)baz;`. And I can legally do pointer arithmetic on this pointer `object[6].a = 42;`. Since `object` was derived from `baz`, *the `int` pointer would have to carry information about the array of `Foo` objects called `bar`. Good luck trying to attach such info to a plain `int*`. – cmaster - reinstate monica Sep 23 '19 at 15:05
  • @cmaster Consider the Symbolics C implementation for Lisp Machines, it made use of the underlying hardware/firmware support for detecting array-out-of-bounds exceptions. C pointers were implemented as a cons of array and index, pointer arithmetic was performed on the index. – Barmar Sep 23 '19 at 15:08
  • 1
    @Barmar I won't deny that you can write tools that work most of the time. My point is, that you cannot catch *all* UB in C++ without also changing the language. The `array + index` approach fails exactly at the example I gave in my last comment. – cmaster - reinstate monica Sep 23 '19 at 15:12
  • 1
    @cmaster I'm not saying that all UB can be caught, but many (if not most) instances of UB are technically detectable. Performance is the reason such detection is not required, so they're left undefined. – Barmar Sep 23 '19 at 15:15
  • @Alexander there _is_ such a flag: `-fsanitize=undefined` in GCC and Clang. May be (or may not be) something else on other compilers. – Ruslan Sep 23 '19 at 16:45
79

Undefined behaviour is one of those things that were recognized as a very bad idea only in retrospect.

The first compilers were great achievements and jubilantly welcomed improvements over the alternative - machine language or assembly language programming. The problems with that were well-known, and high-level languages were invented specifically to solve those known problems. (The enthusiasm at the time was so great that HLLs were somtimes hailed as "the end of programming" - as if from now on we would only have to trivially write down what we wanted and the compiler would do all the real work.)

It wasn't until later that we realized the newer problems that came with the newer approach. Being remote from the actual machine that code runs on means there is more possibility of things silently not doing what we expected them to do. For instance, allocating a variable would typically leave the initial value undefined; this wasn't considered a problem, because you wouldn't allocate a variable if you didn't want to hold a value in it, right? Surely it wasn't too much to expect that professional programmers wouldn't forget to assign the initial value, was it?

It turned out that with the larger code bases and more complicated structures that became possible with more powerful programming systems, yes, many programmers would indeed commit such oversights from time to time, and the resulting undefined behaviour became a major problem. Even today, the majority of security leaks from tiny to horrible are the result of undefined behaviour in one form or another. (The reason is that usually, undefined behaviour is in fact very much defined by things on the next lower level on computing, and attackers who understand that level can use that wiggle room to make a program do not only unintended things, but exactly the things they intend.)

Since we recognised this, there has been a general drive to banish undefined behaviour from high-level languages, and Java was particularly thorough about this (which was comparatively easy since it was designed to run on its own specifically designed virtual machine anyway). Older languages like C can't easily be retrofitted like that without losing compatibility with the huge amount of existing code.

Edit: As pointed out, efficiency is another reason. Undefined behaviour means that compiler writers have a lot of leeway for exploiting the target architecture so that each implementation gets away with the fastest possible implementation of each feature. This was more important on yesterday's underpowered machines than with today, when programmer salary is often the bottleneck for software development.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 62
    I don’t think that a lot of people of C community would agree with this statement. If you would retrofit C and define undefined behavior (e.g. default-initialize everything, chose an order of evaluation for function parameter, etc), the large base of well-behaved code would continue to work perfectly well. Only code that would not be well defined today would be disrupted. On the other side, if you leave undefined as today, compilers would continue to be free to exploit new advances in CPU architectures and code optimisation. – Christophe Sep 21 '19 at 15:11
  • My take on this is that it was only in retrospect that we realize there's a huge gray area sandwiched between "safe" and "unsafe", and new languages can be constructed to benefit from the middle ground. C++ has a legacy in-production code base, as well as several legacy compiler implementations (and their legacy ecosystem) to require backward compatibility. Some safety mechanisms in other languages are new innovations and require significant research to lay the foundation. C++ inherits the C "pay only for what you use" which was good in the past but not so good today. – rwong Sep 21 '19 at 17:53
  • 13
    The main part of the answer does not really sound convincing for me. I mean, it's basically impossible to write a function that safely adds two numbers (as in `int32_t add(int32_t x, int32_t y)`) in C++. The usual arguments around that one are related to efficiency, but often interspersed with some portability arguments (as in "Write once, run... on the platform where you wrote it ... and nowhere else ;-)"). Roughly, one argument could therefore be: Some things are undefined because you don't know whether you're on a 16bit microcontoller or an a 64bit server (a weak one, but still an argument) – Marco13 Sep 21 '19 at 22:46
  • 13
    @Marco13 Agreed - and getting rid of the "undefined behaviour" issue by making something "defined behaviour, but not necessarily what the user wanted and with no warning when it happens" instead of "undefined behaviour" is just playing code-lawyer games IMO. – alephzero Sep 22 '19 at 00:09
  • 3
    The answer could be improved by leaving out the judgmental "a very bad idea". It suffices to say the the environment changed. For example, implementing Java's int arithmetic on a one's-complement machine would be expensive, and such machines were still around in the '80s. Now two's complement is universal. – Arch D. Robison Sep 22 '19 at 03:23
  • 6
    If those "undefined behavior" things are such a terrible problem, why is there so much working C code? Indeed, I would go so far as to say that anyone writing performance-critical code who chooses Java or similar over C is at best ill-informed. – jamesqf Sep 22 '19 at 03:34
  • 9
    "Even today, the majority of security leaks from tiny to horrible are the result of undefined behaviour in one form or another." Citation needed. I thought most of them were XYZ injection now. – Joshua Sep 22 '19 at 04:26
  • 1
    @Joshua: many large systems involve a large number of languages. https://www.cvedetails.com/top-50-products.php A similar question on Security has an answer that suggests classifying by language-specific APIs (including OS interfaces) https://security.stackexchange.com/questions/121645/cves-aggregated-by-programming-language – rwong Sep 22 '19 at 04:41
  • 1
    @Joshua Btw more bugs found not necessarily bad thing: it is the (not generally knowable) undiscovered number of exploitable bugs, a.k.a. zero-days, that describes the risk of a piece of software. – rwong Sep 22 '19 at 04:42
  • 8
    Floating point math is one of these areas where everything is defined. Compiler writers reacted by adding `-ffast-math` and `/fp:fast` because the IEEE standard doesn't map well onto most CPUs. – Simon Richter Sep 22 '19 at 07:27
  • 3
    @jamesqf There are enough languages that are usually just as fast as C that get away with little to none undefined behavior. This is not a choice between Java and C. For a while not correctly ending a comment was UB. That's just insane. – Voo Sep 22 '19 at 08:51
  • @SimonRichter Isn't it more that with a more relaxed view, some additional compiler-transformations are possible? Because there is generally appropriate hardware floating-point available, especially for desktop and bigger. – Deduplicator Sep 22 '19 at 11:13
  • 1
    @Voo: Did I say otherwise? I was just giving an example. But it seems fairly obvious that eliminating some types of potential undefined behavior will impose performance penalties, for instance having the system initialize large arrays to zeor (or some other value) on allocation, when the program is going to fill them with input or computed values. – jamesqf Sep 22 '19 at 16:25
  • 39
    _"Undefined behaviour is one of those things that were recognized as a very bad idea only in retrospect."_ That's your opinion. Many (myself included) do not share it. – Lightness Races in Orbit Sep 22 '19 at 18:17
  • 2
    @Christophe "If you would retrofit C and define undefined behavior [...] the large base of well-behaved code would continue to work perfectly well." – no it wouldn't, because e.g. 100% correct runtime checking of every pointer dereference for out-of-bounds and use-after-free is extremely expensive, so much so that a lot of code that was written in C for speed would probably become unusably slow. It's also impossible to do it without changing the ABI. Between the performance hit and the incompatibility, you might as well just use Java. – benrg Sep 22 '19 at 21:37
  • @benrg I understand that you have a strong opinion on that point. But can you in all objectivity exclude that one day someone comes out with a CVM that could handle it as decently as a JVM ? Would you believe me if I told you that in the 90s I was using Rational InstantC, a C interpreter that could run at an acceptable performance on 80386 and do some extra checks on the top ? By the way, my comment was not about performance, nor about "*my language is better than yours*", but simply about the reasonning error on backwards compatibility in this answer. – Christophe Sep 22 '19 at 22:07
  • 2
    @benrg: Many simple C compilers could 100% define the behavior of everything at the language level. In many cases, the behavioral specification would be something like "Load a pointer from address 8 above the stack frame, as well as a word from the address given by external symbol _x, and store the latter word to the address given in the former, *with whatever consequences result*," but it would be the platform, not the C implementation, that would responsible for determining what those consequences might be. – supercat Sep 23 '19 at 05:46
  • 1
    @benrg: In practice, very few applications would need semantics quite that precise with regard to automatic variables, and thus few applications would need implementations to define the behavior of absolutely everything, but in cases where some parts of the Standard and an implementation's documentation describe the behavior of some action, performing that action in that manner without regard for whether some other part of the Standard would characterize it as UB would greatly expand the range of programs a compiler could usefully process. – supercat Sep 23 '19 at 05:49
  • 1
    @Christophe A CVM would sort of look like having UBSan, ASan in deployment. A way to validate all pointers (type and bound) or else stop executing. (Sorry for keeping this discussion long.) – rwong Sep 23 '19 at 08:04
  • 1
    @rwong you seem to claim that this cannot be done. When java was launched in 1995, a lot of people were skeptical that JVM could be used for heavy load processing in real world situations. This proved to be wrong. So it’s not a question of technical possibility (controlling pointer is not an issue with the CPU control over virtual memory). It’s a question of interest of the language community. All this is just about the tradeoff btw performance vs. security and ease of use. Debating of what is better makes no sense: there are use cases for both sides. – Christophe Sep 23 '19 at 09:11
  • 1
    Relevant: as well as undefined behaviour, C [also has](https://stackoverflow.com/questions/2397984/undefined-unspecified-and-implementation-defined-behavior) unspecified and implementation-defined kinds. – rlms Sep 23 '19 at 13:58
  • 5
    *This was more important on yesterday's underpowered machines than with today* Efficiency is no less important today then it used to be. In fact it might be even more important. If Amazon can increase efficiency by even 1% that is a heck of a dollar amount of money saved on energy and hardware needed. Also in today's mobile world we are dealing with devices with finite power so every little bit of more work we can do per the same watt is a win. – NathanOliver Sep 23 '19 at 14:51
  • @LightnessRacesinOrbit could you elaborate on that? What is a good side of _undefined behavior_? – Mayou36 Sep 23 '19 at 16:52
  • @Christophe: "Acceptable performance" is not a constant, it's an application-dependent variable. What might be acceptable for consumer apps is not really appreciated in fields like computational modelling. Running in 10 ms rather than 9 on your phone is seldom an issue; taking 10 weeks rather than 9 on a large cluster is another matter. The fact that we don't have the "underpowered" machines of the past just means we can run bigger problems :-) (And security is seldom a real issue for such programs.) – jamesqf Sep 23 '19 at 17:30
  • 2
    @Mayou36 (a) Don't pay for what you don't use; (b) freedom for implementations (they use assumptions-that-you-didn't-cause-UB far more than you might imagine, when optimising) – Lightness Races in Orbit Sep 23 '19 at 17:46
  • 1
    @Joshua In this list, 4 out of the top 25 are C undefined behaviour, including the top 2: https://cwe.mitre.org/top25/archive/2019/2019_cwe_top25.html – Nacht Sep 23 '19 at 23:19
  • As a C programmer for decades, this is an excellent answer. Upvoted! – Emanuel Landeholm Oct 08 '19 at 11:25
  • @Nacht: Which four are you counting as Undefined Behavior? I count six: Improper restriction of operations within buffer, out-of-bounds read, out-of-bounds write, null dereference, use after free, and integer overflow. – supercat Jul 13 '20 at 17:16
43

Undefined behavior enables significant optimization, by giving the compiler latitude to do something odd or unexpected (or even normal) at certain boundary or other conditions.

See http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

Use of an uninitialized variable: This is commonly known as source of problems in C programs and there are many tools to catch these: from compiler warnings to static and dynamic analyzers. This improves performance by not requiring that all variables be zero initialized when they come into scope (as Java does). For most scalar variables, this would cause little overhead, but stack arrays and malloc'd memory would incur a memset of the storage, which could be quite costly, particularly since the storage is usually completely overwritten.


Signed integer overflow: If arithmetic on an 'int' type (for example) overflows, the result is undefined. One example is that "INT_MAX+1" is not guaranteed to be INT_MIN. This behavior enables certain classes of optimizations that are important for some code. For example, knowing that INT_MAX+1 is undefined allows optimizing "X+1 > X" to "true". Knowing the multiplication "cannot" overflow (because doing so would be undefined) allows optimizing "X*2/2" to "X". While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion. A more important optimization that this allows is for "<=" loops like this:

for (i = 0; i <= N; ++i) { ... }

In this loop, the compiler can assume that the loop will iterate exactly N+1 times if "i" is undefined on overflow, which allows a broad range of loop optimizations to kick in. On the other hand, if the variable is defined to wrap around on overflow, then the compiler must assume that the loop is possibly infinite (which happens if N is INT_MAX) - which then disables these important loop optimizations. This particularly affects 64-bit platforms since so much code uses "int" as induction variables.

Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
  • 29
    Of course, the real reason why signed-integer overflow is undefined is that when C was developed, there were at least three different representations of signed integers in use (one's-complement, two's-complement, sign-magnitude, and perhaps offset binary), and each gives a different result for INT_MAX+1. Making overflow undefined permits `a + b` to be compiled to the native `add b a` instruction in every situation, rather than potentially requiring a compiler to simulate some other form of signed integer arithmetic. – Mark Sep 22 '19 at 02:41
  • 2
    Allowing integer overflows to behave in *loosely defined* fashion allows significant optimizations *in cases where all possible behaviors would meet application requirements*. Most of those optimizations will be forfeit, however, if programmers are required to avoid integer overflows at all costs. – supercat Sep 22 '19 at 14:20
  • 6
    @supercat Which is another reason why avoiding undefined behaviour is more common in more recent languages - programmer time is valued a lot more than CPU time. The kind of optimizations C is allowed to do thanks to UB are essentially pointless on modern desktop computers, and make reasoning about code much harder (not to mention the security implications). Even in performance critical code, you can benefit from high-level optimizations that would be somewhat harder (or even much harder) to do in C. I have my own software 3D renderer in C#, and being able to use e.g. a `HashSet` is wonderful. – Luaan Sep 23 '19 at 07:10
  • 3
    @supercat: W.r.t._loosely defined_, the logical choice for integer overflow would be to require _Implementation Defined_ Behavior. That is an existing concept, and it's not an undue burden on implementations. Most would get away with "it's 2's complement with wrap-around", I suspect. `<<` might be the difficult case. – MSalters Sep 23 '19 at 10:06
  • @MSalters There is an simple and well studied solution that is neither undefined behavior or implementation defined behavior: nondeterministic behavior. That is, you can say "`x << y` evaluates to some valid value of the type `int32_t` but we won't say which". This allows implementers to use the fast solution, but does not act as a false precondition allowing time-travel style optimizations because the nondeterminism is constrained to the output of this one operation - the spec guarantees that memory, volatile variables, etc are not visibly affected by the expression evaluation. ... – Mario Carneiro Sep 24 '19 at 01:20
  • ... Most modern languages that use "undefined behavior" mean this kind of behavior, with constrained nondeterminism. Only C/C++ take UB to mean the "nasal demon" style UB AFAIK. – Mario Carneiro Sep 24 '19 at 01:22
  • @MarioCarneiro: Requiring an unspecified choice would impede some optimizations compared with non-deterministic behavior, but for the latter to really be useful there must be a way of coercing it to defined behavior. For example, if a compiler knows that `x` will be positive when it executes `int y=x+x;`, and `z` will be zero, a non-determinisitic model would allow the compiler to evaluate `y<0` as false but `y – supercat Sep 24 '19 at 14:29
  • @MarioCarneiro: Making trapping behavior implementation-defined would add additional complications. If a compiler determines that `int x` and `int y` are loop invariant, and `x/y` is computed within the loop, should a compiler be required to ensure that any side-effects that precede the computation will be executed even if `y` is zero, or should it be allowed to hoist the computation to the start of the first iteration of the loop so it can have subsequent iterations branch past it? – supercat Sep 24 '19 at 14:53
  • @supercat When I say a nondeterministic value, I mean it's as if the arch rolled the dice and assigned `y` to some actual value. It's not a special "other" value, it's an actual value of the type, and so if `z = 0` then `y<0` and `y – Mario Carneiro Sep 24 '19 at 18:09
  • As for trapping behavior, you are right that this adds complication, and my nondeterministic stores imply that no trap is triggered. (If some arch decides to trap then the compiler would have to work around that.) I would much rather see compiler visible asserts available to create preconditions rather than inferred preconditions from usage. In the absence of a precondition, x/y is a side effecting call so you can't move it out of the loop. But I think it is pretty important for a spec to say where control flow ends up after a call to x/0. – Mario Carneiro Sep 24 '19 at 18:17
  • @MarioCarneiro: When I say "non-deterministic" I mean in the sense of a non-deterministic finite automaton. If applying a cast operator to a non-deterministic value selects a possibility in Unspecified fashion, and `x` is a non-deterministic value, then `(int)((x & 3)+(x & 9))` would yield an arbitrarily-chosen value from the set {0,1,2,3,4,8,9,10,11,12}. – supercat Sep 24 '19 at 18:41
  • I think we are saying the same things, but just to clarify: I agree on your definition of nondeterminism, but in any individual run `x` has a particular value, and `((x & 3)+(x & 9))` computes a function of that value, which is why it would always lie among those 10 values. But if you compute the same function of `x` twice (in the same run) you will always get the same result twice. ... – Mario Carneiro Sep 24 '19 at 18:50
  • A "nondeterministic value" is an abstraction of the values that a variable receives across multiple runs (resolutions of the nondeterminism), so you wouldn't say what an operation does when given a nondeterministic value because operations only receive particular values; the nondeterminism is only globally visible. – Mario Carneiro Sep 24 '19 at 18:51
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/99060/discussion-between-supercat-and-mario-carneiro). – supercat Sep 24 '19 at 19:08
20

In C's early days, there was a lot of chaos. Different compilers treated the language differently. When there was interest to write a specification for the language, that specification would need to be fairly backwards-compatible with the C that programmers were relying on with their compilers. But some of those details are non-portable and do not make sense in general, for example assuming a particular endianess or data layout. The C standard therefore reserves a lot of details as undefined or implementation-specified behaviour, which leaves a lot of flexibility to compiler writers. C++ builds upon C and also features undefined behaviour.

Java tried to be a much safer and much simpler language than C++. Java defines the language semantics in terms of a thorough virtual machine. This leaves little space for undefined behaviour, on the other hand it makes requirements that can be difficult for a Java implementation to do (e.g. that reference assignments must be atomic, or how integers work). Where Java supports potentially unsafe operations, they are usually checked by the virtual machine at runtime (for example, some casts).

amon
  • 132,749
  • 27
  • 279
  • 375
  • So are you saying, backwards compatibility is the only reason why C and C++ are not getting out of undefined behaviours? – Sisir Sep 21 '19 at 12:41
  • 3
    It's definitely one of the bigger ones, @Sisir. Even among experienced programmers, you'd be surprised how much stuff that shouldn't break _does_ break when a compiler changes how it handles undefined behaviour. (Case in point, there was a bit of chaos when GCC started optimising out "is `this` null?" checks a while back, on the grounds that `this` being `nullptr` is UB, and thus can never actually happen.) – Justin Time - Reinstate Monica Sep 21 '19 at 23:11
  • 10
    @Sisir, another big one is speed. In C's early days, hardware was far more heterogeneous than it is today. By simply not specifying what happens when you add 1 to INT_MAX, you can let the compiler do whatever is fastest for the architecture (eg. a one's-complement system will produce -INT_MAX, while a two's-complement system will produce INT_MIN). Similarly, by not specifying what happens when you read past the end of an array, you can have a system with memory protection terminate the program, while one without won't need to implement expensive runtime bounds-checking. – Mark Sep 22 '19 at 02:47
14

JVM and .NET languages have it easy:

  1. They don't have to be able to work directly with hardware.
  2. They only have to work with modern desktop and server systems or reasonably similar devices, or at least devices designed for them.
  3. They can impose garbage-collection for all memory, and forced initialization, thus getting pointer-safety.
  4. They got specified by a single actor who also provided the single definitive implementation.
  5. They get to choose safety over performance.

There are good points for the choices though:

  1. Systems programming is a whole different ballgame, and uncompromisingly optimising for application programming instead is reasonable.
  2. Admittedly, there is less exotic hardware all the time, but small embedded systems are here to stay.
  3. GC is ill-suited for non-fungible resources, and trades much more space for good performance. And most (but not nearly all) forced initializations can be optimized away.
  4. There are advantages to more competition, but committees mean compromise.
  5. All those bounds-checks do add up, even though most can be optimized away. Null pointer checks can mostly be done by trapping access for zero overhead thanks to virtual address space, though optimisation is still inhibited.

Where escape-hatches are provided, those invite full-blown undefined behavior back in. But at least they are generally only used in few very short stretches, which are thus easier to manually verify.

Deduplicator
  • 8,591
  • 5
  • 31
  • 50
  • 3
    Indeed. I program in C# for my job. Every once in awhile I reach for one of the unsafe-hammers (`unsafe` keyword or attributes in `System.Runtime.InteropServices`). By keeping this stuff to the few programmers who know how to debug unmanaged stuff and again as little of it as practical, we keep issues down. It's been more than 10 years since the last performance-related unsafe-hammer but sometimes you gotta do it because there's literally no other solution. – Joshua Sep 22 '19 at 04:32
  • 21
    I frequently work on a platform from analog devices where sizeof (char) == sizeof (short) == sizeof (int) == sizeof (float) == 1. It also does saturating addition (so INT_MAX+1 == INT_MAX), and the nice thing about C is that I can have a conforming compiler that generates reasonable code. If the language mandated say twos complement with wrap around then every addition would end up with a test and a branch, something of a non starter in a DSP focused part. This is a current production part. – Dan Mills Sep 22 '19 at 14:08
  • 1
    "only have to work with modern desktop and server systems or reasonably similar devices" is definitely not true, Java runs on a lot of low resource phones and other embedded systems (they're too small for JIT compilation and too slow for software interpretation, so the interpreter has hardware acceleration, typically "Jazelle") – Ben Voigt Sep 22 '19 at 18:55
  • @BenVoigt And you really wouldn't classify that hardware acceleration as making the device "reasonably similar"? – Deduplicator Sep 22 '19 at 19:00
  • 5
    @BenVoigt Some of us live in a world where a small computer is maybe 4k of code space, a fixed 8 level call/return stack, 64 bytes of RAM, a 1MHz clock and costs <$0.20 in quantity 1,000. A modern mobile phone is a small PC with pretty much unlimited storage for all intents and purposes, and can be pretty much treated as a PC. Not all the world is multicore and lacks hard realtime constraints. – Dan Mills Sep 22 '19 at 23:17
  • 2
    @DanMills: Not talking about modern mobile phones here with Arm Cortex A processors, talking about "feature phones" circa 2002. Yes 192kB of SRAM is a lot more than 64 bytes (which is not "small" but "tiny"), but 192kB also hasn't been accurately called "modern" desktop or server for 30 years. Also these days 20 cents will get you an MSP430 with a lot more than 64 bytes of SRAM. – Ben Voigt Sep 22 '19 at 23:37
  • 1
    @Deduplicator: Nope, Jazelle is about the most dissimilar Java execution environment from a desktop or server JVM as you can get. – Ben Voigt Sep 22 '19 at 23:37
  • @BenVoigt At least it's designed for it. Changed the answer. – Deduplicator Sep 23 '19 at 02:38
  • 1
    You might want to have a look at some experimental managed memory OSes. It's not likely they'll ever be used for desktops or servers (since they just aren't compatible with native applications, whatever the ABI/CPU), but having *everything* managed makes it possible to do optimizations that native code couldn't ever do. The point about tiny embedded systems still stands, but once you get to something like a hundred kiB of RAM, you're fine. Managed doesn't strictly require a GC either - it's just a relatively simple solution to having managed memory. It's all about the trade-offs. – Luaan Sep 23 '19 at 07:19
  • 3
    @BenVoigt 192kB might not be a desktop in the last 30 years, but I can assure you that it is entirely sufficient to be serving web pages, which I would argue makes such a thing a server by the very definition of the word. Fact is that that is an entirely reasonable (generous, even) amount of ram for a LOT of embedded applications that often include configuration web servers. Sure, I probably am not running amazon on it, but I just might be running a fridge complete with IOT crapware on such a core (With time and space to spare). Don't nobody need interpreted or JIT languages for that! – Dan Mills Sep 23 '19 at 16:02
  • @Luaan: Some managed execution environments for embedded platforms are experimental, and others are in full production. But there's no truth to "having everything managed makes it possible to do optimizations that native code couldn't ever do". The advantages are ease of use, and ability to guarantee fault isolation. But native code can (and on small devices often does) play all the same performance tricks that managed environments can, such run in a single address space allowing direct sharing of memory between tasks. – Ben Voigt Sep 23 '19 at 16:26
  • @DanMills: Sorry but "server" in "modern desktop and server systems" does not mean "anything capable of accepting a TCP connection". And only someone totally ignoring computer security concerns would think that resource level is acceptable for operation on the Internet (which is after all a defining factor in IoT). On a physically secure local network, why not? But on the Internet? Such a device is guaranteed not to survive even a low-bandwidth DoS attack. Having one incoming TCP connection does not make something a server platform. – Ben Voigt Sep 23 '19 at 16:31
  • @BenVoigt It's the software isolation part that's impossible. The point is that you don't lose any of the safety guarantees, but avoid the runtime cost of hardware process isolation. – Luaan Sep 23 '19 at 17:44
  • @BenVoigt Oh sure IOT sucks, but you over estimate the resources required to support TCP/IP. You can pretty much merge the Ethernet, TCP & IP layers into a state machine and some timers, and somewhere between 16 and 64 bytes of state per connection (Plus one 1500 byte shared buffer)? You have to be explicit about how many connections you can support, but that holds anywhere. The elephant in the security room is that I never know what IP address my box will have, and most CA are reluctant to issue me a wildcard for * with an expiry far enough in the future to suit my 20 year design life. – Dan Mills Sep 23 '19 at 20:23
  • @DanMills: 1500 byte buffer is only enough if you ignore fragmentation... according to the standards you need 64k, and even "support for most existing peers" you should have 9k. And with just one buffer your throughput will be miserable. You need cryptographically unpredictable initial sequence numbers (to prevent your one valid connection from getting killed by rogue TCP RST). You need SYN cookies (to prevent half-open connections from filling your tiny connections table). You need to handle arbitrary TCP options (like large sliding window) in some way even if you're going to reject them. – Ben Voigt Sep 23 '19 at 20:29
  • @BenVoigt, Given that the raw ethernet frame is limited to 1500 bytes, (Ignoring jumbo frames, seems reasonable), and I can process the bytes as they come in with a state machine rather then waiting to assemble a whole packet), I only need buffering to deal with out of order frames, (and you can just nak them...). Randomness is actually not that hard, think zener diode, or reverse breakdown in a be junction and an ADC (All little processors have ADCs these days).... Now making it crypographically **good** is hard, but good enough to beat syn and rst flooding? Not so much. – Dan Mills Sep 23 '19 at 21:48
9

First a quick aside: for the purposes of this answer, I'm going to lump "undefined behavior" and "implementation defined behavior" together as all being "undefined behavior". The primary difference between the two is that an implementation needs to document implementation defined behavior. At least to me this seems like a small enough difference that it doesn't matter much.

The real reason comes down to a fundamental difference in intent between C and C++ on one hand, and Java and C# (for only a couple of examples) on the other. For historical reasons, much of the discussion here talks about C rather than C++, but (as you probably already know) C++ is a fairly direct descendant of C, so what it says about C applies equally to C++.

Although they're largely forgotten (and their existence sometimes even denied), the very first versions of UNIX were written in assembly language. Much of (if not solely) the original purpose of C was port UNIX from assembly language to a higher level language. Part of the intent was to write as much of the operating system as possible in a higher level language--or looking at it from the other direction, to minimize the amount that had to be written in assembly language.

To accomplish that, C needed to provide nearly the same level of access to the hardware as assembly language did. One of the stated goals of C++ has always been that it should continue to provide the same low-level access to hardware as C does.

The PDP-11 (for one example) mapped I/O registers to specific addresses. For example, you'd read one memory location to check whether a key had been pressed on the system console. One bit was set in that location when there was data waiting to be read. You'd then read a byte from another specified location to retrieve the ASCII code of the key that had been pressed.

Likewise, if you wanted to print some data, you'd check another specified location, and when the output device was ready, you'd write your data yet another specified location.

To support writing drivers for such devices, C allowed you to specify an arbitrary location using some integer type, convert it to a pointer, and read or write that location in memory.

Of course, this has a pretty serious problem: not every machine on earth has its memory laid out identically to a PDP-11 from the early 1970's. So, when you take that integer, convert it to a pointer, and then read or write via that pointer, nobody can provide any reasonable guarantee about what you're going to get. Just for an obvious example, reading and writing may map to separate registers in the hardware, so (contrary to normal memory) if you write something, then try to read it back, what you read may not match what you wrote.

I can see a few possibilities that leaves:

  1. Define an interface to all possible hardware--specify the absolute addresses of all the locations you might want to read or write to interact with hardware in any way.
  2. Prohibit that level of access, and decree that anybody who wants to do such things needs to use assembly language.
  3. Allow people to do that, but leave it up to them to read (for example) the manuals for the hardware they're targeting, and write the code to fit the hardware they're using.

Of these, 1 seems sufficiently preposterous that it's hardly worth further discussion. 2 is basically throwing away the basic intent of the language. That leaves the third option as essentially the only one they could reasonable consider at all.

Another point that comes up fairly frequently is the sizes of integer types. C takes the "position" that int should be the natural size suggested by the architecture. So, if I'm programming a 32-bit VAX, int should probably be 32 bits, but if I'm programming a 36-bit Univac, int should probably be 36 bits (and so on). It's probably not reasonable (and might not even be possible) to write an operating system for a 36-bit computer using only types that are guaranteed to be multiples of 8 bits in size. Maybe I'm just being superficial, but it seems to me that if I were writing an OS for a 36-bit machine, I'd probably want to use a language that supported a 36-bit type.

From a language viewpoint, this leads to still more undefined behavior. If I take the largest value that will fit into 32 bits, what will happen when I add 1? On typical 32-bit hardware, it's going to roll over (or possibly throw some sort of hardware fault). On the other hand, if it's running on 36-bit hardware, it'll just...add one. If the language is going to support writing operating systems, you can't guarantee either behavior--you just about have to allow both the sizes of types and the behavior of overflow to vary from one to another.

Java and C# can ignore all of that. They aren't intended to support writing operating systems. With them, you have a couple of choices. One is to make the hardware support what they demand--since they demand types that are 8, 16, 32 and 64 bits, just build hardware that supports those sizes. The other obvious possibility is for the language to only run on top of other software that provides the environment they want, regardless of what the underlying hardware might want.

In most cases, this isn't really an either/or choice. Rather, many implementations do a little of both. You normally run Java on a JVM running on an operating system. More often than not, the OS is written in C, and the JVM in C++. If the JVM is running on an ARM CPU, chances are pretty good that the CPU includes ARM's Jazelle extensions, to tailor the hardware more closely to Java's needs, so less needs to be done in software, and the Java code runs faster (or less slowly, anyway).

Summary

C and C++ have undefined behavior, because nobody's defined an acceptable alternative that allows them to do what they're intended to do. C# and Java take a different approach, but that approach fits poorly (if at all) with the goals of C and C++. In particular, neither seems to provide a reasonable way to write system software (such as an operating system) on most arbitrarily chosen hardware. Both typically depend on facilities provided by existing system software (usually written in C or C++) to do their jobs.

Jerry Coffin
  • 44,385
  • 5
  • 89
  • 162
  • Very nice answer, which I hadn't noticed previously. Unfortunately, nobody's "officially" defined a language that provides practical means by which programmers can reliably accomplish the things that could be done in pre-standard C without the compiler "optimizations" getting in the way. – supercat Jul 13 '20 at 17:18
8

Java and C# are characterized by a dominant vendor, at least early in their development. (Sun and Microsoft respectively). C and C++ are different; they've had multiple competing implementations from early on. C especially ran on exotic hardware platforms, too. As a result, there was variation between implementations. The ISO committees that standardized C and C++ could agree on a large common denominator, but at the edges where implementations differ the standards left room for the implementation.

This is also because choosing one behavior might be expensive on hardware architectures that are biased towards another choice - endianness is the obvious choice.

MSalters
  • 8,692
  • 1
  • 20
  • 32
  • What does a “large common denominator” mean **literally**? Are you talking about subsets or supersets? Do you realy mean enough factors in common? Is this like the least common multiple or the greatest common factor? This is very confusing for us robots who don't speak street lingo, just maths. :) – tchrist Sep 23 '19 at 02:31
  • @tchrist: The common behavior is a subset, but this subset is pretty abstract. In many areas left unspecified by the common standard, real implementations must make a choice. Now some of those choices are pretty clear and therefore implementation-defined, but others are more vague. Memory layout at runtime is an example: there has to be _a_ choice, but it's not clear how you'd document it. – MSalters Sep 23 '19 at 06:57
  • 2
    The original C was made by one guy. It already had plenty of UB, by design. Things certainly got worse as C became popular, but UB was there from the very beginning. Pascal and Smalltalk had far less UB and were developed at pretty much the same time. The main advantage C had was that it was extremely easy to port - all the portability issues were delegated to the application programmer :P I've even ported a simple C compiler to my (virtual) CPU; doing something like LISP or Smalltalk would have been far greater effort (though I did have a limited prototype for a .NET runtime :). – Luaan Sep 23 '19 at 07:29
  • @Luaan: Would that be Kernighan or Ritchie? And no, it didn't have Undefined Behavior. I know, I have had the original AT&T stenciled compiler documentation on my desk. The implementation did what it did. There was no distinction between unspecified and undefined behavior. – MSalters Sep 23 '19 at 10:00
  • 5
    @MSalters Ritchie was the first guy. Kernighan only joined (not much) later. Well, it didn't have "Undefined Behaviour", because that term didn't exist yet. But it did have the same behaviour that would today be called undefined. Since C didn't have a specification, even "unspecified" is a stretch :) It was just something the compiler didn't care about, and the details were up to application programmers. It wasn't designed to produce portable *applications*, only the compiler was meant to be easy to port. – Luaan Sep 23 '19 at 11:00
  • 1
    @Luaan: Many actions that today's "Standard" characterizes as "Undefined Behavior" had unambiguous semantics when targeting systems similar to the PDP-11 for which C was designed. On that system, adding two integers would yield the bottom 16 bits of the result, interpreted as a two's-complement number, and there was no doubt but that implementations for 16-bit quiet-wraparound two's-complement platforms should behave the same way, though implementations for other kinds of platforms should behave differently. The Standard deliberately waived jurisdiction over "non-portable" constructs... – supercat Feb 28 '23 at 00:03
  • 1
    ...but failed to make sufficiently abundantly clear that such waiver of jurisdiction was not in any way, shape, or form intended to prohibit the use of such constructs within programs *that weren't intended to be portable*, and that support for such constructs was a *quality of implementation* issue. – supercat Feb 28 '23 at 00:05
5

The authors of the C Standard expected their readers to recognize something they thought was obvious, and alluded to in their the published Rationale, but didn't say outright: the Committee shouldn't need to order compiler writers to meet their customers' needs, since the customers should know better than the Committee what their needs are. If it's obvious that compilers for certain kinds of plaforms are expected to process a construct a certain way, nobody should care whether the Standard says that construct invokes Undefined Behavior. The Standard's failure to mandate that conforming compilers process a piece of code usefully in no way implies that programmers should be willing to buy compilers that don't.

This approach to language design works very well in a world where compiler writers need to sell their wares to paying customers. It completely falls apart in a world where compiler writers are isolated from the effects of the marketplace. It's doubtful the proper market conditions will ever exist to steer a language the way they had steered the one that became popular in the 1990s, and even more doubtful that any sane language designer would want to rely upon such market conditions.

supercat
  • 8,335
  • 22
  • 28
  • I feel that you have described something important here, but it escapes me. Could you clarify your answer? Especially the second paragraph: it says the conditions now and the conditions earlier are different, but I don't get it; what exactly changed? Also, the "way" is now different than earlier; maybe explain this too? – anatolyg Sep 22 '19 at 16:24
  • 4
    Seems your campaign to replace all undefined behavior with unspecified behavior or something more constrained is still going strong. – Deduplicator Sep 22 '19 at 16:56
  • 2
    @anatolyg: If you haven't already, read the published C Rationale document (type C99 Rationale in Google). Page 11 lines 23-29 talk about the "marketplace", and page 13 lines 5-8 talk about what is intended with regard to portability. How do you think a boss at a commercial compiler company would react if a compiler writer told programmers who complained that the optimizer broke code that every other compiler handled usefully that their code was "broken" because it performs actions not defined by the Standard, and refused to support it because that would promote the continued... – supercat Sep 23 '19 at 05:02
  • 2
    ...use of such constructs? Such a viewpoint is readily apparent on the support boards of clang and gcc, and has served to impede the development of intrinsics that could facilitate optimization far more easily and safely than the broken language gcc and clang want to support. – supercat Sep 23 '19 at 05:04
  • 1
    @supercat: You're wasting your breath complaining to the compiler vendors. Why not direct your concerns to the language committees? If they agree with you, an errata will be issued which you can use to beat the compiler teams over the head. And that process is much quicker than the development of a new version of the language. But if they disagree, you're at least going to get actual reasons, whereas the compiler writers are just going to repeat (over and over) "We didn't designate that code broken, that decision was made by the language committee and we follow their decision." – Ben Voigt Sep 23 '19 at 21:23
  • 1
    @BenVoigt: My goal is not to complain to compiler vendors, but rather to make people aware that some compiler writers openly disregard the stated intentions of the authors of the C Standard. – supercat Sep 23 '19 at 23:37
  • In C, it isn’t possible to check whether a char* points to an array element of an array of char without either undefined behaviour or iterating through all array elements. That’s because calculating the difference between unrelated pointers is undefined behaviour instead of producing an unspecified result. – gnasher729 Feb 27 '23 at 21:11
  • @gnasher729: Both clang and gcc will assume that an access to a pointer which is coincidentally equal to a "just past" pointer for one object will not interact with the following object, even in cases where the pointer was in fact formed by taking the address of the latter object. – supercat Feb 27 '23 at 21:25
4

C++ and c both have descriptive standards (the ISO versions, anyway).

Which only exist to explain how the languages work, and to provide a single reference about what the language is. Typically, compiler vendors, and library writers, lead the way and some suggestions get included in the main ISO standard.

Java and C# (or Visual C#, which I assume you mean) have prescriptive standards. They tell you what's in the language definitively ahead of time, how it works, and what's considered permitted behavior.

More important than that, Java actually has a "reference implementation" in Open-JDK. (I think Roslyn counts as the Visual C# reference implementation, but couldn't find a source for that.)

In Java's case, if there's any ambiguity in the standard, and Open-JDK does it a certain way. The way Open-JDK does it is the standard.

bobsburner
  • 157
  • 4
  • 1
    The situation is worse than that: I don't think the Committee has ever achieved consensus about whether it's supposed to be descriptive or prescriptive. – supercat Oct 24 '19 at 22:16
1

Undefined behaviour allows the compiler to generate very efficient code on a variety of architecturs. Erik's answer mentions optimization, but it goes beyond that.

For example, signed overflows are undefined behaviour in C. In practice the compiler was expected to generate a simple signed addition opcode for the CPU to execute, and the behaviour would be whatever that particular CPU did.

That allowed C to perform very well and produce very compact code on most architectures. If the standard had specified that signed integers had to overflow in a certain way then CPUs which behaved differently would have needed a lot more code generating for a simple signed addition.

That's the reason for much of the undefined behaviour in C, and why things like the size of int vary between systems. Int is architecture dependent and generally selected to be the fastest, most efficient data type that is larger than a char.

Back when C was new these considerations were important. Computers were less powerful, often having limited processing speed and memory. C was used where performance really mattered, and developers were expected to understand how computers worked well enough to know what these undefined behaviours would actually be on their particular systems.

Later languages such as Java and C# preferred eliminating undefined behaviour over raw performance.

user
  • 525
  • 1
  • 5
  • 10
-5

In a sense, Java also have it. Suppose, you gave incorrect comparator to Arrays.sort. It can throw exception of it detects it. Otherwise it will sort an array in some way that is not guaranteed to be any particular.

Similarly if you modify variable from several threads results are also unpredictable.

C++ is just went further to make undefined more situations(or rather java decided to define more operations) and to have a name for it.

RiaD
  • 1,700
  • 2
  • 12
  • 13
  • 4
    That's not undefined behavior of the sort we're talking about here. "Incorrect comparators" come in two types: ones that define a total ordering, and ones that don't. If you provide a comparator that consistently defines the relative ordering of items, the behavior is well-defined, it's just not the behavior that the programmer wanted. If you provide a comparator that isn't consistent about the relative ordering, the behavior is still well-defined: the sort function will thrown an exception (which also probably isn't the behavior the programmer wanted). – Mark Sep 22 '19 at 02:57
  • 2
    As for modifying variables, race conditions generally aren't considered undefined behavior. I don't know the details of how Java handles assignments to shared data, but knowing the general philosophy of the language, I'm pretty sure it's required to be atomic. Simultaneously assigning 53 and 71 to `a` would be undefined behavior if you could get 51 or 73 out of it, but if you can only get 53 or 71, it's well-defined. – Mark Sep 22 '19 at 03:01
  • @Mark With chunks of data larger than the native word size of the system (for example, a 32 bit variable on a 16-bit word size system), it is possible to have an architecture that requires storing each 16-bit portion separately. (SIMD is another potential such situation.) In that case, even a simple source-code-level assignment is not necessarily atomic unless special care is taken by the compiler to ensure that it is executed atomically. – user Sep 23 '19 at 08:29
  • On a 68000 processor, the add instruction set flags to indicate whether the result is mathematically >, = or < 0. Even in the presence of overflow. So intmax + 1 sets the “greater” flag even though the result is negative. If you don’t allow undefined behaviour you need an additional compare with zero to find the result is negative. – gnasher729 Feb 27 '23 at 21:17
  • On intel processors, shift by 64 bit positions is the same as shift by 0 positions. On ARM, shift by n does the same as shift by (n modulo 256). As a result, shift by the number of bits in the type is undefined behaviour. Which is very inconvenient. Making it defined on intel would make all shifting code slow. – gnasher729 Feb 27 '23 at 21:22