How are comments usually parsed?

Question

How are comments generally treated in programming languages and markup? I am writing a parser for some custom markup language and want to follow the principle of least surprise, so I'm trying to determine the general convention.

For example, should a comment embedded within a token 'interfere' with the token or not? Generally, is something like:

Sys/* comment */tem.out.println()

valid?

Also, if the language is sensitive to new lines, and the comment spans the new line, should the new line be considered or not?

stuff stuff /* this is comment
this is still comment */more stuff

be treated as

stuff stuff more stuff

or

stuff stuff
more stuff

?

I know what a few specific languages do, nor am I looking for opinions, but am looking for whether or not: is there a general consensus what is generally expected by a mark up in regards to tokens and new lines?

My particular context is a wiki-like markup.

Does the newline exist inside of the comment? Why would it be treated any differently from any other character in the comment? — , Oct 29 '15 at 16:53
@Snowman there is that perspective, but on the other hand if token 'x' has special meaning if its the first token on the line and it appears to be the first token on the line to both the person looking at the source and to the parser reading line-by-line. Seems like a dilemma so I asked the question. — Sled, Oct 29 '15 at 17:05
I needed to do this exactly to the spec a while ago and found [gcc's docs](https://gcc.gnu.org/onlinedocs/gcc-3.0.1/cpp_1.html#IDX6) to be an excellent resource. There are some weird corner cases you might not have considered. — Karl Bielefeldt, Oct 29 '15 at 20:19

score 40 · Answer 1 · edited Oct 30 '15 at 21:16

40

Usually comments are scanned (and discarded) as part of the tokenization process, but before parsing. A comment works like a token separator even in the absence of whitespace around it.

As you point out, the C specification explicitly states that comments are replaced by a single space. It is just specification-lingo though, since a real-world parser will not actually replace anything, but will just scan and discard a comment the same way it scans and discards whitespace characters. But it explains in a simple way that a comment separates tokens the same way a space would.

The content of comments are ignored, so linebreaks inside multiline comments have no effect. Languages which are sensitive to line breaks (Python and Visual Basic) usually do not have multiline comments, but JavaScript is one exception. For example:

return /*
       */ 17

Is equivalent to

return 17

not

return
17

Single-line comments preserve the line break, i.e.

return // single line comment
    17

is equivalent to

return
17

not

return 17

Since comments are scanned but not parsed, they tend not to nest. So

 /*  /* nested comment */ */

is a syntax error, since the comment is opened by the first /* and closed by the first */

edited Oct 30 '15 at 21:16

Sled

1,868
2
17
24

answered Oct 29 '15 at 17:07

JacquesB

57,310
21
127
176

3

In most languages in-line comments (`/* like this */`) are considered equal to a single whitespace, and a EOL-terminated comments (`// like this`) to a blank line. – 9000 Oct 29 '15 at 17:28
@JacquesB so I'm thinking of treating comments as being replaced in their entirety from the source as a [zero-width space](https://en.wikipedia.org/wiki/Zero-width_space), that seems to be equivalent to what you are suggesting. – Sled Oct 29 '15 at 18:02
1

@artb an ordinary space should work just fine, and lies in the ASCII code page. – John Dvorak Oct 29 '15 at 18:43
@JanDvorak a space will effect appearance and removes the understanding and is closer to the semantics of "a comment isn't really there". The primary rendering output will be HTML so in my case ASCII is not as issue as browsers support Unicode. That said, I believe the C standard mandates that comments are replaced with a single space. – Sled Oct 29 '15 at 18:53
@artb if you find the spec, you can quote it in an answer. – John Dvorak Oct 29 '15 at 18:54
@ArtB: Yeah it makes sense to treat a comment as semantically equivalent to a space character. You won't literally replace it in the stream though, as you still want the scanner to track line and column position correctly for all tokens, so you can provide error messages etc. with correct source position further in the process. – JacquesB Oct 29 '15 at 19:03
@JanDvorak From the [C89 draft](http://port70.net/~nsz/c/c89/c89-draft.txt) in `2.1.1.2` item #3 _"Each comment is replaced by one space character."_ But this is C-specific and this question is for the most common convention across languages and format. – Sled Oct 29 '15 at 19:05
In pre-Standard C, some implementations treated `ident/* comment */ifier` as a single token. – dan04 Oct 29 '15 at 20:39
1

Some languages, notably Racket, do have nested multi-line comments: `(define x #| this is #| a sub-comment |# the main comment |# 3) x` yields `3`. – wchargin Oct 30 '15 at 05:13
[There](http://dlang.org/lex.html#Comment) [_are_](http://doc.rust-lang.org/reference.html#comments) [languages](https://www.haskell.org/onlinereport/lexemes.html) [with](http://scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#whitespace-and-comments) [nestable](http://rigaux.org/language-study/syntax-across-languages/Vrs.html#VrsCmmnt) [comments](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28syntax%29#Block_comments). – Michał Politowski Oct 30 '15 at 13:41
VB(A) *does* allow multiline comments; an *instruction* lives in a *logical line*, which corresponds to one or more *actual* lines of code in the code file. It's ugly as hell though. – Mathieu Guindon Oct 31 '15 at 03:00

score 9 · Answer 2 · edited Oct 29 '15 at 19:18

9

To answer the question:

is there a general consensus what is generally expected by a mark up?

I would say none would expect a comment embedded inside of a token to be legal.

As a general rule of thumb, comments should be treated the same as whitespace. Any place that would be valid to have extraneous whitespace should also be allowed to have an embedded comment. The only exception would be strings:

trace("Hello /*world*/") // should print Hello /*world*/

It would be quite odd to support comments inside strings, and would make escaping them tedious!

edited Oct 29 '15 at 19:18

8bittree

5,637
3
27
37

answered Oct 29 '15 at 17:12

Connor Clark

191
4

2

Never thought about strings, that's a good edge case. My current thought was doing simple regex between comment start and end and replacing it with a single space. That would have tripped up your case. – Sled Oct 29 '15 at 17:39
3

+1 for that bit about escaping strings. Although, in your example, I would generally expect it to print `Hello /* world*/!` rather than suppressing the comment delimiters. Also, welcome to Programmers! – 8bittree Oct 29 '15 at 17:42
1

Thanks 8bittree! And that's totally what I meant. Funnily enough, I also need to escape the \*\* in my answer.... – Connor Clark Oct 29 '15 at 18:52
2

@ArtB in general, "parsing by substitution" gets very tricky down the road with edge cases and interaction with other features, and is best avoided from the beginning. – hobbs Oct 30 '15 at 01:08

score 7 · Answer 3 · answered Oct 29 '15 at 17:25

In whitespace-insensitive languages, ignored characters (i.e. whitespaces or those that are part of a comment) delimit tokens.

So for example Sys tem are two tokens, while System is one. The usefulness of this might be more apparent if you compare new Foo() and newFoo() one of which will construct an instance of Foo while the other calls newFoo.

Comments can play the same role as a run of whitespaces, e.g. new/**/Foo() works the same as new Foo(). Of course this can be more complex, e.g. new /**/ /**/ Foo() or whatnot.

Technically, it should be possible to allow comments within identifiers, but I doubt it is particularly practical.

Now, what of white-space sensitive languages?

Python comes to mind and it has a very simple answer: no block comments. You start a comment with # and then the parser works exactly as though the rest of the line didn't exist but were just a newline instead.

In contrast to that, jade allows for block comments, where the block ends when you get back to the same indentation level. Example:

body
  //-
    As much text as you want
    can go here.
  p this is no longer part of the comment

So in this realm, I wouldn't say you could say how things are usually handled. What seems to be a commonality, is that a comment always ends with an end-of-line, which means that all comments act exactly the same as new lines.

Hmm, the newline is the real issue since we are using HTML\XML syntax for comments so it will be multi-line. — Sled, Oct 29 '15 at 17:40
@ArtB If you're using HTML/XML syntax, it might be wise to simply use their behavior. — 8bittree, Oct 29 '15 at 17:46
@8bittree makes sense, should have thought of that. I'll leave the question as is since it will be more useful this way. — Sled, Oct 29 '15 at 18:00
Early C used `/**/` to concatenate identifiers within macros, but later the `##` operator was added. — yyny, Jan 29 '21 at 12:57

score 3 · Answer 4 · answered Oct 30 '15 at 05:34

In the past I've turned comments into a single token as a part of the lexical analysis. The same goes for strings. From there, life is easy.

In the specific case of the last parser I built, an escape rule is passed to the top level parse routine. The escape rule is used to handle tokens such as comment tokens inline with the core grammar. In general, these tokens were discarded.

A consequence of doing it this way is that the example you posted with a comment in the middle of a identifier, the identifier would not be a single identifier - this is the expected behaviour in all languages (from memory) that I've worked with.

The case of a comment within a string should be implicitly handled by the lexical analysis. The rules to handle a string have no interest in comments, and as such the comment is treated as the contents of the string. The same applies to a string (or quoted literal) within a comment - the string is a part of a comment, which is explicitly a single token; the rules for processing a comment have no interest in strings.

I hope that makes sense/helps.

So if you have code such as `console.log(/*a comment containing "quotes" is possible*/ "and a string containing /*slash-star, star-slash*/ is possible")`, where there’s quotes in a comment and comment syntax in a string, how would the lexer know to tokenize it correctly? Can you please edit your answer, providing a general description of those cases? — chharvey, May 27 '19 at 01:41

score 2 · Answer 5 · answered Dec 26 '17 at 10:02

It depends on what purpose your parser has. If you write a parser to build a parse tree for compiling than a comment has no semantic value beside potentially separating tokens (e.g method/comment/(/comment/)). In this case, its treated like spaces.

If your parser is part of a transpiler translating one source language into another source language or if your parser is a preprocessor taking a compilation unit in a source language, parsing it, modify it and writing the modified version back in the same source language, comments like anything else become very important.

Also if you have meta information in comments and you especially care for comments like when generating API-documentation like JavaDoc does, comments are all of a sudden very important.

Here comments often are attached to the tokens itself. If you find a comment you attach it to be a comment of a token. Since a token can have multiple tokens before and after, it is again purpose-depending how to handle those comments.

The idea of annotating non-comment tokens with having comments is to remove comments from the grammar altogether.

Once you have the parse tree some AST start to unpack comments representing each token by its own AST-Element but being attached to another AST-Element beside the usual contains-relationship. A good idea is to check all parser/AST implementations for source languages available in open-source IDE.

One very good implementation is the Eclipse compiler infrastructure for the Java language. They preserve comments during tokenization and represent comments within the AST - as far as I remember. Also, this parser/AST implementation preserves formatting.

How are comments usually parsed?

5 Answers5