18

I am writing a limited C/C++ code parser. Now, multiplication and pointer signs give me really a tough time, as both are same. For example,

int main ()
{
  int foo(X * p); // forward declaration
  bar(x * y);  // function call
}

I need to apply special rules to sort out if * is indeed a pointer. In above code, I have to find out if foo() is a forward declaration and bar() is a function call. Real world code can be lot more complex. Had there been different symbol like @ for pointers, then it would have been straight forward.

The pointers were introduced in C, then why some different symbol was not chosen for the same ? Was keyboard so limited ?

[It will be an add-on if someone can throw light on how modern day parser deal with this ? Keep in mind that, in one scope X can be typename and another scope it can be a variable name, at the same time.]

iammilind
  • 2,232
  • 4
  • 24
  • 35
  • 2
    In addition to what others have said, it's important to remember that you cannot rely on whitespace. `X*p`, `X * p`, `X* p` and `X *p` all mean exactly the same thing. Hence the interesting mistakes like `char* p, q` where you expect `q` to be `char*` but it really is just plain old `char`. – user Dec 12 '11 at 12:28
  • 2
    It is just a matter of legacy and habits. Remember that C is a very old programming language. Look into Rust or Go for something newer! – Basile Starynkevitch Jun 08 '15 at 12:44
  • Also, when C was invented, computing resources where thousand times less than today! – Basile Starynkevitch Jun 08 '15 at 12:48
  • At last, C syntax is sufficiently widespread to be easily understood by many developers. Remember that [semantics](https://en.wikipedia.org/wiki/Semantics_%28computer_science%29) is much more difficult than syntax! – Basile Starynkevitch Jun 08 '15 at 12:56
  • @BasileStarynkevitch a thousand! In my day I'd have loved a computer with a thousand time less power than my current PC. When C was invented, I think you're looking at a million times less at 18Mhz CPU and 64Kb RAM! Word processing was still as responsive as it is today however :-) – gbjbaanb Jun 08 '15 at 13:11
  • 7
    Anyone past the earliest beginner stage would write int ans = (**a) * 3; which isn't confusing at all. Don't complain about the language when it is entirely in your hands to write clear code. – gnasher729 Jun 08 '15 at 13:12
  • There is another _really good_ question/answer aside from the dupe link I posted that goes into more depth on the keyboard and digraphs used in the early C days. I am not sure if it is at Prog.SE or SO though, and I cannot seem to find it. –  Jun 08 '15 at 13:21
  • 2
    I think @MichaelT might have found it: [Why does C use the asterisk for pointers?](http://programmers.stackexchange.com/q/252023/22815) –  Jun 08 '15 at 14:14
  • 1
    To be honest, I really can't tell if the example is ugly *because of the * notation* or because of not using enough/consistent whitespace between operators overall. If you don't like `int ans = **a * 5` then you can always say, `int x = **a; int ans = x * 5;` – Brandin Jun 09 '15 at 12:46
  • @AndresF., your linked question was asked 3 years after this question. Now decide which is duplicate. Interesting that the accepted answers of both are different. – iammilind Jun 12 '15 at 01:20
  • Syntax is important, and avoiding confusion with other contemporary languages was important -- Pascal used ^ for pointer dereferencing. – ChuckCottrill Jun 12 '15 at 03:38
  • @iammilind Don't take it the wrong way! This isn't something negative about your question. Also, relative age of both questions isn't relevant. I think the accepted answer here is less accurate than one of the answers in the other question ([which comes "straight from the horse's mouth"](http://programmers.stackexchange.com/a/273268/16247)). – Andres F. Jun 12 '15 at 13:40
  • @iammilind A possible compromise would be to close the other question but copy the answer quoting Ken Thompson here (you could post it yourself and accept it) – Andres F. Jun 12 '15 at 15:46
  • @AndresF., your point is valid. Let moderators take a call. – iammilind Jun 12 '15 at 16:37

6 Answers6

19

I do not know for sure, but I would be willing to bet that the answer is because the inventors of C ran out (or nearly ran out) of characters.

The original intention was that the language should use symbols for operators, and also that the language must be expressable within the so-called "lower" ASCII table, which consists of ASCII values 0 through 127.

The first 31 values are unprintable control characters, so we are left with this:

!"#$%&'()*+,-./0123456789:;<=>?

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

`abcdefghijklmnopqrstuvwxyz{|}~

Remove the space, and all characters needed for identifiers, (which are all letters, all numbers, and the underscore,) and and we are left with this:

!"#$%&'()*+,-./:;<=>?@[\]^{|}~

Remove parentheses, braces, and brackets, as well as the backtick (`), which is not sufficiently dissimilar from the single quote, and we are left with this:

!"#$%&'*+,-./:;<=>?@\^|~

Remove punctuation which is critical for the syntax of the language, ("#',.:;?) and we are left with this:

!$%&*+-/<=>@\^|~

Remove the ampersand (&), which stands for "taking the address of" something, and which must therefore be distinctly different from anything that has to do with pointers, as well as comparisons, (<=>), exclamation (!), plus (+), and minus (-), which are valid operations to perform on pointers, and we are left with this:

$%*/@\^|~

I am not sure why the dollar sign and the at-sign were not considered; perhaps they were thought of as being a bit too bulky to make frequent use of in code. So, take them out, and we are left with these:

%*/\^|~

As you can see, all of the above characters are used as operators for various arithmetic and logical operations in C, which means that they had to reuse one of them as the character for declaring and dereferencing pointers.

They chose the asterisk (*), which is a decent choice because it somewhat bears the notion of a "point", and because multiplication is not one of the operations which are valid to perform on pointers.

They could have also chosen the caret (^), (which stands for exclusive-OR,) but I do not think that it is such a huge deal, and clearly, their options were very limited.

Mike Nakis
  • 32,003
  • 7
  • 76
  • 111
  • 6
    Pascal uses `^` for [pointers](http://www.freepascal.org/docs-html/ref/refse15.html). Kernighan had a [few gripes with that language](http://www.lysator.liu.se/c/bwk-on-pascal.html) and may have wanted to do things differently. –  Jun 08 '15 at 13:40
  • 3
    Many older languages (such as BASIC) used the `$` in variable declarations. Too, it's usually only available on **US** keyboards, although I don't know if this consideration would have been important to the original designers. – Clockwork-Muse Jun 08 '15 at 13:44
  • @Clockwork-Muse The `$` symbol is available on many keyboards because it is the currency symbol of many [popular currencies](http://en.wikipedia.org/wiki/Currency_symbol). –  Jun 08 '15 at 14:09
  • 3
    You might also tweak your set of characters given the [response from Ken Thompson](http://programmers.stackexchange.com/a/273268/40980) to the query (noting that C got it from B): "b was designed to be run with a teletype model 33 teletype. (5 bit baud-o code) so the use of symbols was restricted." - you can see its character set in [Wikimedia](http://en.wikipedia.org/wiki/File:TTY33ASR.jpg). –  Jun 08 '15 at 14:09
  • 1
    While `*` may somewhat look like a "point" it is also a "splat" which means "squished bug" and we all know pointers are involved in many software bugs, or defects. –  Jun 08 '15 at 14:12
  • 5
    If I could change 5 things in C I would change pointer from "*" to "^", attribution from "=" to ":=", equal from "==" to "=", "||" to "or" and "&&" to "and". – Mandrill Jun 08 '15 at 14:23
  • 4
    @Mandrill although what you just said probably qualifies as ***blasphemy***, I would tend to agree with you. We will probably burn in hell for this. – Mike Nakis Jun 08 '15 at 14:55
  • @Mandrill: Some want symbols, some want (english) words. I'm firmly in the proper camp, let the heretics burn for their ignorance. (The only thing both camps probably agree on, no guarantee though, is that the precedence-rules could have been better.) – Deduplicator Jun 08 '15 at 22:58
  • @Deduplicator: An advantage of words over symbols is that one can have more of them. For example, one could define separate tokens for real division as well as the three kinds of integer division, the three kinds of integer modulus/remainder, as well as integer division/modulus operators which request whichever of the above will be fastest in a given scenario. Having separate operators for different corner cases may be a slight nuisance for the compiler writer, but compilers are vastly outnumbered by programs which would otherwise need to handle those corner cases themselves. – supercat Jun 10 '15 at 20:46
  • 1
    Since Pascal used ^ for pointers, that may have been sufficient reason to avoid its use. – ChuckCottrill Jun 12 '15 at 03:40
  • In my opinion, I think the at-sign would've been the best for pointer syntax. It would've better explained how pointers work like: `int i = 10;` `int @p = &i;` `@p = 15;` The code makes more sense, especially with dereferencing as the at-sign gives a meaning like "**at** this address". – Nergal Aug 29 '19 at 21:20
18

Yes, the same symbols are being reused, because there were no UTF32 back there. So you have * as a pointer type, * as a dereference operator, * as a multiplication operator, and that's just in C. You also have a similar problem with "&" for example ("&" as address-off, "&" as bitwise-end and "&" as part of "&&" - logical and), and others.

The lexical parsers differentiate between them based on the context.

in your example, you have two different paths in your parser: one that starts with a type (a variable/forward declaration) and one that doesn't (function call). If there's an ambiguity - you get a compilation error.

If you're using a subset of C - you need to make sure you get the right subset of the grammar that handles this issue.

littleadv
  • 4,704
  • 27
  • 26
  • 4
    Also the keyboards on many systems were much more limited back then - witness [digraphs and trigraphs](http://en.wikipedia.org/wiki/Digraph_%28computing%29#C). – Péter Török Dec 12 '11 at 10:53
  • Don't forget the useless but possible variant `(int) foo(X * p); // cast without effect on a function call` – user281377 Dec 12 '11 at 11:33
  • 2
    @PéterTörök: I think di- and tri-graphs were introduced to deal with 7-bit character sets that replaces various symbols with national symbols (liek, say, the Swedish 7-bit that replaces [\] with ÄÖÅ and {|} with äåö). The actual Swedish keyboards had no less keys than US keyboards of that day. – Vatine Dec 12 '11 at 13:03
7

There are two rules in the C BNF that make writing a parser difficult:

if-statement: "if" expression statement |
              "if" expression statement "else" statement

leads to the dangling else problem, where

if a if b c else d

can be parsed as either of

(if-statement (a) (if-statement (b) (c) (d)))
(if-statement (a) (if-statement (b) (c)) (d))

The typical resolution for this conflict is to prefer shift over reduce, attaching the "else" branch to the inner if-statement.

The other, more difficult problem is

typedef-name: identifier

which makes the language context sensitive. This conflict is resolved by omitting the rule in the parser and creating a separate token for typedef-names; for this to work, the scanner must have a table of names that have been declared as typedef.

For C++, the rules are a lot more complex, and it is usually simpler to write an integrated scanner/parser that resolves all identifiers.

Simon Richter
  • 1,568
  • 9
  • 10
  • 2
    It is not "difficult" at all, since the C standard explicitly resolves the ambiguity: `else` always belong to the innermost `if`. – SK-logic Dec 13 '11 at 10:26
4

They're resolved by using symbol tables. This symbol table keeps track of the declarations you've seen to sort out whether or not you're calling a function that's already been declared. When the parser encounters an identifier, it makes a lookup to the symbol table to see what it is. You will find similar situations for type names- especially since C has multiple namespaces and you can start shadowing them. These rules are non-trivial.

typedef int MahInt;
MahInt * p; // declaration or multiplication?

It is not possible to parse C without symbol tables.

As for why it is that way, because keyboards at the time were very limited- for example digraphs and trigraphs, which produce symbols that were unusual at the time. For example, witness the "WTF" operator:

int x, y;
x ??!??! y;

which is really

x || y
DeadMG
  • 36,794
  • 8
  • 70
  • 139
3

The typedef was a relatively late addition to the C language.

In earlier versions of C, the grammar defined what is or is not a type name fairly straightforwardly. Many type names: int, char, double, etc. were (and still are) single keywords. Other type names included keywords or symbols: struct foo, char *, int[42]. A non-keyword identifier could never be a type name.

When the typedef construct was added to the existing language, it wasn't possible to change the grammar to be able to treat a non-keyword identifier as a type name without either creating ambiguities or breaking existing code. For example, in:

int foo() {
    x*y;
}

x*y could be either an expression statement that multiplies x by y and discards the result, or a declaration of y as a pointer to type x.

One way of looking at it is that a typedef creates a new keyword, one that exists only until the end of the scope in which it's defined. That means that the parser has to look at the symbol table to know how to interpret things (something that's not true for a lot of other languages; for example, in Pascal an identifier can be a type name, and this doesn't introduce an ambiguity).

For example:

int x, y;
int foo() {
     x * y; /* x isn't type name, so this is an expression statement */
     {
         typedef int x;
         x *y; /* Now x is a type name (effectively a keyword),
                  so this is a declaration */
     }
     x * y; /* The type name x is now out of scope,
               so it's an expression statement again */
}

(Treating typedef names as keywords is just one way to look at it; I don't mean to suggest that compilers actually do that internally.)

Keith Thompson
  • 6,402
  • 2
  • 29
  • 35
1

Consider using a GLR parsing, it allows you to postpone the choice between the ambiguous syntax interpretations till you've got type information. That is how parsers like Elsa work.

Another alternative is to collect and expand the types while parsing - e.g., it works well with an ad hoc recursive descent parsing implementation. This approach is used in both gcc and Clang.

There are more possibilities available: you can still use a high level generator producing a Packrat parser out of PEG specifications - this way you can stuff more side effect logic into a parsing without having to implement the whole thing manually, as with an ad hoc approach.

SK-logic
  • 8,497
  • 4
  • 25
  • 37