Is a literal out of range a syntax error or a semantic error?

Question

I am reading more about the differences between syntax and semantics, but I am still wondering about this one.

Let's assume that we have a language that only allows integers to be in the range of 0-255, so 8-bits, and the parser receives 256 or 257.

Should that be an error of syntax or semantics?

Resources I've checked:

Doesn’t this depend on whether this is caught during the lexing or parsing passes or during the evaluation of the syntax tree? In which case it depends on the environment; there is no single answer. — James McLeod, Jun 18 '19 at 19:57

Erik Eidt · Accepted Answer · 2019-06-19T13:47:10.627

It depends on the grammar of the language.

If the grammar precludes recognizing 256 or 257, then those character sequences are syntax errors — that character sequence is outside of what can be recognized. For example, if the grammar admits one or two digits as well as 1[0-9][0-9], and 2[0-4][0-9], and 25[0-5], and nothing else.

When we get a syntax error, the parser cannot find a legal parsing for this input — it has to be an error. A good language implementation might try its best to build parse output, but would know it is accepting an illegal program, (e.g. it might do this in order to try to report still other syntax errors instead of simply stopping at one).

However, if the grammar allows arbitrary digit strings, yet language precludes values above 255, these have to be found as semantic, or even runtime errors (e.g. 128 + 128 could be either a semantic or runtime error).

There is thus some latitude in how these things are handled, and it goes to whether parsing can proceed or not — a more general grammar might allow parsing the program with 256 so as to continue on past parsing and report other errors in the program as well.

Some languages publish their official grammar; others have unofficial grammars.

In the case of Rust, when a function expects a 32-bit integer, and you pass it something greater than 2,147,483,647, the compiler yells at you. In this case it would be a semantic error because it depends on the function signature, right? — Joan Vene, Jun 19 '19 at 19:17
Yes, the program is recognized (i.e. it parses; it matches the grammar), but a (non-syntactic) rule is violated, so it is a semantic issue detected at compile time. — Erik Eidt, Jun 19 '19 at 21:53

score 4 · Answer 2 · answered Jun 22 '19 at 06:07

I am pretty sure (but don't have a proof for this), that you can turn any static error into a syntax error by making the grammar arbitrarily more complex.

Historically, we have distinguished between lexical analysis (lexing / scanning), syntactic analysis (parsing), and semantic analysis (type checking, type inference, name resolution, etc.) for performance reasons: it was not possible to fit an entire compiler into the memory of a computer, so we broke it down into several components.

The reader turns a stream of bits into a stream of characters (this is usually a simple mapping table, but it might be more complex in the case of variable-length encodings such as UTF-8)
The lexer turns a stream of characters into a stream of tokens (typically, a lexer recognizes a regular language, i.e. a lexer is a Finite Automaton)
The parser turns a stream of tokens into a parse tree (concrete syntax tree) (typically, a parser recognizes a context-free language)
The semantic analyzer first turns a parse tree into an abstract syntax tree (using arbitrary Turing-capable computation)
The various further components of the compiler annotate the AST with additional information, transform it into a different representation, etc.

Nowadays, this is no longer strictly necessary. The reader is typically supplied by the operating system or the standard library of the language. Scannerless parsers are thing, i.e. parsers that integrate scanning into the parsing process.

And, if we want to, we can build arbitrarily complex parsers that recognize much more than just context-free languages. Both modern lexer and parser frameworks allow you to build lexers and parsers that are more powerful than the traditional regular / context-free lexers and parsers. Plus, they allow you to attach arbitrary code in a Turing-complete language. E.g. when you generate C code from GNU Bison, you can embed arbitrary C code of your own into the generated parser which can interoperate with and influence the generated part of the code.

In your example, it would be easy to modify the grammar to only allow integers from 0–255. In a language like Java, where you can use bytes in some places and longs in others, expressing this as a grammar would be much more complex. And so, what the designers of Java did, was to define just one integer literal, and then define semantic rules for how this literal is to be interpreted in different contexts.

However, I do believe that it would be possible to check this in the parsing stage and thus make it a syntax error … it would just make the parser more complex, in fact, it would make the parser as complex as the entire compiler (at least the analysis part of the compiler minus optimizer and code generator).

score 0 · Answer 3 · answered Jun 23 '19 at 20:30

You decide as the compiler writer. If you are a user of a compiler, it doesn't really make much difference.

But there are different situations where different ranges of integer literals would be acceptable, so producing different scanner tokens for different ranges, then handling this with syntax, would be an absolute pain for the compiler writer. And what if you declare "int a [100];" where an assignment a[99] = 0; is legal but a[100] = 0; isn't? This can only reasonably be checked with a semantic rule.

Is a literal out of range a syntax error or a semantic error?

3 Answers3