26

A few do, but not any of the popular ones as far as I know. Is there something bad about nesting comments?

I plan to have block comments nest in the (small) language I'm working on, but I would like to know if this is a bad idea.

Craige
  • 3,791
  • 21
  • 30
amara
  • 1,290
  • 11
  • 18
  • re a few answers: ohh, that makes sense =) I'm totally doing nested block comments then; although I do have a separate lexing stage, it's not the limiting sort SK-logic described. –  Jun 02 '11 at 11:05
  • @Vuntic: If you have a separate lexing stage that uses stuff more complicated than regular expressions, you may have performance issues. REs are fast and easy to use by implementing DFAs. – David Thornley Jun 02 '11 at 16:18
  • It catches more errors earlier to not allow nesting –  Jun 02 '11 at 16:21
  • 4
    @David: ...not at all. It's actually really fast. – amara Jun 02 '11 at 20:28
  • I would suggest that if you want to allow nested comments, you allow start-comment tags to be marked with a token, and require that if a start-comment tag is thus marked, its end-comment tag must be marked identically. That would allow unbalanced start/end tags to be quickly identified, and avoid the possibility of bugs caused by undetected unbalanced tags. – supercat May 16 '14 at 20:40

11 Answers11

20

Because most of the implementations are using separate lexing and parsing stages, and for lexing they're using plain old regular expressions. Comments are treated as whitespaces - i.e., ignored tokens, and thus should be resolved entirely in a lexing pass. The only advantage of this approach is parsing speed. Numerous disadvantages include severe limitations on syntax (e.g., a need to maintain a fixed, context-independent set of keywords).

SK-logic
  • 8,497
  • 4
  • 25
  • 37
  • 3
    I would disagree on 'most' nowadays. Certainly that's the traditional way, but I know that for C, EDG combines the preprocessor, lexing and parsing, and I suspect that both GCC and Microsoft do too. The benefit is that it allows you to implement them separately if you need to. – Andrew Aylett Jun 02 '11 at 09:56
  • Clang is doing the same, too. But that's still only a tiny proportion of the existing popular languages compilers. – SK-logic Jun 02 '11 at 10:00
  • @Neil Butterworth, take a look at mcs, javac, gcc (yep, it back-patches a lexer, but still it is a dedicated lexing pass), clang (same as gcc), dmd, fpc, and many, many more. – SK-logic Jun 04 '11 at 23:09
  • No one is using regular expressions in their lexing for any non-trivial compiler. – Nuoji Jan 10 '19 at 11:47
  • @Nuoji - for the non-trivial - sure. But those who rely on flex and similar tools do. – SK-logic Jan 11 '19 at 15:36
  • The question was why "most programming languages" don't nest. That would preclude the most trivial. And even with something like lex it's straightforward to do. – Nuoji Jan 11 '19 at 20:06
  • @Nuoji, read specs (where they actually do exist) for most programming languages - the tokens *are* defined using a regular grammar. Like it or not, that's what people do (for no good reason whatsoever). In fact, the very idea of a dedicated lexing pass cannot be justified. – SK-logic Jan 13 '19 at 13:56
  • @SK-logic Here in this answer there is an example of lexing nested comments for lex: https://stackoverflow.com/questions/34493467/how-to-handle-nested-comment-in-flex. It's exceedingly easy. Even easier if you write your own lexer – which most people do, even though you claim otherwise. – Nuoji Jan 13 '19 at 19:13
  • @Nuoji, many compilers actually have a dedicated lexing pass, and that's a fact. Whether we can say most compilers behave like that or only a vast minority, that's irrelevant, I think. We're not here to discuss how many language compilers are implemented a certain way, but SK-logic has provided the best reason here why comments cannot nest in many languages. Other than that, if other languages disallow nested comments for design reasons, I'm okay with that. But the reason provided here is meaningful and I don't understand why losing your time going against that. – Davide Cannizzo Jul 08 '21 at 19:33
  • @DavideCannizzo a dedicated lexing pass does not imply that the lexer is using pure regular expressions for parsing. Yes, flex etc uses regex to describe certain elements (e.g. numbers), but for comments - even non nesting - state variables are typically used to track the state. This is all very simple and fundamental as you know. Nesting comments are perfectly easy to implement this way. Consequently neither handwritten nor generated lexers have any problem with nested comments contrary to SK's claim which is simply incorrect. – Nuoji Aug 10 '21 at 13:50
  • 1
    @Nuoji that'd be a dirty hack - constructing some special kind of a state machine for only certain lexemes - all for what? To keep doing things the way Dragon Book prescribed? There is no point in it when you can just go lexerless and use exactly the same mechanism for parsing tokens and token streams. If you're lazy you'll use flex and alike and be restricted with it, otherwise you won't have a lexer at all. – SK-logic Aug 10 '21 at 16:02
  • @Nuoji, I happen to agree with SK-logic. It's true, regex cannot handle nesting without overcomplicated recursions, and many lexers that are actually used leverage regular expressions. However, there's no reason why you'd have to write a lexer at all, which should be avoided both for performance reasons and to avoid keeping state in both of them, which makes the compiler also less maintainable as more changes will need modifications on both the lexer and the rest. – Davide Cannizzo Aug 10 '21 at 16:30
  • @SK-logic and Davide, I have already provided a link of how to do it in flex. Why do you insist pretending that flex can't handle it? – Nuoji Aug 11 '21 at 11:58
  • Also, your comments make me wonder what sort of compilers you've written (if any). And one final amusing tidbit: in Crafting Interpreters, the first chapter homework is implementing nested comments: http://craftinginterpreters.com/scanning.html too bad that this is horribly hard, right? – Nuoji Aug 11 '21 at 12:15
  • @Nuoji I suggest that you read about *modern* parsing techniques. Forget all that flex crap and dirty hacks. We're in 21st century, we must use PEG or GLR, not stupid lexers with hacks. Also, I would not use "Crafting Interpreters" as an authoritative source on a state of the art. It's a nice introduction for beginners, but it's awfully outdated. As for what you can do when you ditch lexers, see this: https://github.com/combinatorylogic/clike – SK-logic Aug 11 '21 at 13:32
  • Again, your claim is that languages do not have nested comments because it's too hard. I wonder what "the state of the art of lexing" is going to prove. Either it cannot handle nested comments (which handcrafted lexers can and thus "state of the art" is ludicrous) or it can, and then it doesn't support your claim. Your choice. – Nuoji Aug 20 '21 at 18:43
  • I've run into combinatorylogic from Reddit on several occasions. Is that you? – Nuoji Aug 20 '21 at 19:02
  • In reality few compilers actually use what @SK-logic terms "*modern* parsing techniques" and simply use handwritten parsers and lexers: https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html – Nuoji Aug 21 '21 at 21:13
  • @Nuoji they do indeed, since most of the compilers are older than the first paper on PEGs. It worth mentioning though that quite a few do lexerless, and that PEG is, essentially, just a syntax sugar for describing recursive descent parsers - i.e., exactly what most of those hand-written parsers are. – SK-logic Aug 30 '21 at 15:07
  • But you're saying PEG parsers can't handle nested comments? Since that is your explanation why programming languages don't nest comments? – Nuoji Aug 30 '21 at 20:55
  • @Nuoji no, I'm saying that in recursive descent *lexerless* parsing handling comments (and any other language context switching in general) is trivial. Sadly, a lot of older compilers written as a combination of regexp-based lexer and recursive descent parser. Only recently people started shifting to using the same recursive descent for both lexing and parsing (rendering it, effectively, lexerless). And when you use PEG to express recursive descent parsing, lexerless implementation is the most natural. – SK-logic Aug 31 '21 at 07:37
  • @SK-logic what older compilers would that be? You are just making this up as you go along aren't you? It is established that lex/flex can handle nested comments fine, a handwritten lexer can do it as well. Please provide examples of multiple popular languages written in the last 40 years that had implementations with lexers that wouldn't be able to handle nested comments. – Nuoji Aug 31 '21 at 11:53
  • @Nuoji look at any language specification and see that tokens are defined as regular grammars. For pretty much all the languages that have such a spec. And this is exactly for this reason - when they were defined, people only used regular expressions for lexing. – SK-logic Aug 31 '21 at 12:10
  • Sigh. With that argument C would not even exist since it cannot be defined as a context free grammar. And let's not get started with Fortran which can't even be expressed with lex. You are just sprouting made up facts at this point. – Nuoji Aug 31 '21 at 17:20
  • Google "fortran lexer problem" for an introduction to the issues. Most compiler books tend to mention it if you bother to read them. – Nuoji Aug 31 '21 at 17:27
  • All C *lexemes* can be represented as regular expressions. Pay attention, we're not talking about the whole grammar here. Same thing for Fortran. Nobody is forcing you to parse numeric literal as a SINGLE token. – SK-logic Sep 01 '21 at 06:54
12

It's perfectly possible to make a lexer that can handle nested comments. When it's eating whitespace, when it sees /* it can increment a depth counter, and decrement it when it sees */, and stop when the depth is zero. That said, I've done many parsers, and never found a good reason for comments to nest.

If comments can nest, then a downside is it's easy to get their ends unbalanced, and unless you have a fancy editor, it can invisibly hide code you assume is there.

An upside of comments that don't nest is something like this:

/*
some code
more code
blah blah blah
/**/

where you can easily comment the code in or out by removing or adding the first line - a 1-line edit. Of course, if that code itself contains a comment, this would break, unless you also allow C++-style // comments in there. So that's what I tend to do.

Mike Dunlavey
  • 12,815
  • 2
  • 35
  • 58
  • 1
    `//` comments are also C99-style. – JAB Jun 02 '11 at 20:58
  • 1
    Alternatively, a language could specify a start-of-comment is `/*$token`, where `identifier` is any alphanumeric token, and end-of-comment is `token$*/`. It would be relatively simple for tokenizer to include code to verify that every end-comment mark contains the proper token for its matching start-comment block. – supercat May 16 '14 at 20:36
  • A lexer which keep a stack is not really a lexer anymore, it is a parser. – JacquesB Aug 11 '21 at 13:16
  • The counter argument (easy to create unbalanced comments) makes no sense. The comments are part of the language, the compiler can and should bark on unbalanced comments just like on unbalanced braces or parentheses. – Martin Maat Aug 22 '21 at 06:53
6

Since nobody else mentioned it, I'll list a few languages that do support nested comments: Rexx, Modula-2, Modula-3, Oberon. Despite all the complaints here about difficulty and speed issues, none of those seem to have any huge problems.

Rugxulo
  • 61
  • 1
  • 1
6

A good point of nesting block comments is that you can comment out large portions of code easily (well, almost, unless you have the block comment end sequence in a string constant).

An alternative method is to prepend a bunch of line with the line comment start sequence if you have an editor that supports it.

Haskell has nested block comments, but most people dont seem to notice or to complain about it. I guess this is because people that do not expect nested comments tend to avoid them as this would be a lexical error in other languages.

Ingo
  • 3,903
  • 18
  • 23
5

One thing nobody's mentioned yet, so I'll mention it: The desire to nest comments often indicates that the programmer is Doing It Wrong.

First, let's agree that the only time "nesting" or "not nesting" is visible to the programmer is when the programmer writes something structurally like this:

do_something();
/* comment /* nested comment */ more comment */
do_something_else();

Now, when does such a thing come up in practice? Certainly the programmer isn't going to be writing nested comments that literally look like the above snippet! No, in practice when we nest comments (or wish we could nest them), it's because we want to write something like this:

do_something();  /* do a thing */
/* [ajo] 2017-12-03 this turned out to be unnecessary
do_something_else(); /* do another thing */
*/

And this is BAD. This is not a pattern we (as language designers) want to encourage! The correct way to write the above snippet is:

do_something();  /* do a thing */

That "wrong" code, that false start or whatever it was, doesn't belong in the codebase. It belongs, at best, in the source-control history. Ideally, you'd never even write the wrong code to begin with, right? And if the wrong code was serving a purpose there, by warning maintainers not to reinstate it for some reason, well, that's probably a job for a well-written and intentional code comment. Trying to express "don't do X" by just leaving in some old code that does X, but commented out, is not the most readable or effective way to keep people from doing X.

This all boils down to a simple rule of thumb that you may have heard before: Don't comment out code. (Searching for this phrase will turn up a lot of opinions in agreement.)

Before you ask: yes, languages such as C, C#, and C++ already give the programmer another tool to "comment out" large blocks of code: #if 0. But this is just a particular application of the C preprocessor, which is a large and useful tool in its own right. It would actually be extremely difficult and special-casey for a language to support conditional compilation with #if and yet not support #if 0.


So, we've established that nested comments are relevant only when the programmer is commenting out code; and we've established (via consensus of a lot of experienced programmers) that commenting out code is a Bad Thing.

To complete the syllogism, we must accept that language designers have an interest in promoting Good Things and discouraging Bad Things (assuming that all else is equal).

In the case of nested comments, all else is equal — you can safely ignore the low-voted answers that claim that parsing nested /* would somehow be "difficult" for the parser. (Nested /* are no harder than nested (, which just about every parser in the world already needs to handle.)

So, all else being equal, should a language designer make it easy to nest comments (i.e., to comment out code), or difficult? Recall that commenting out code is a Bad Thing.

Q.E.D.


Footnote. Notice that if you don't allow nested comments, then

hello /* foo*/bar.txt */ world

is a misleading "comment" — it's equivalent to

hello bar.txt */ world

(which is likely a syntax error). But if you do allow nested comments, then

hello /* foo/*.txt */ world

is a misleading "comment" — it's equivalent to

hello

but leaves the comment open all the way to the end of the file (which again is almost certainly a syntax error). So neither way is particularly less prone to unintentional syntax errors. The only difference is in how they handle the intentional antipattern of commented-out code.

Quuxplusone
  • 457
  • 3
  • 8
  • 9
    I have different opinion based on simply fact -- I didn't see everything (and neither did you). So while those golden rules like "Don't comment out code" look nice, life has its own paths. In this particular case, I do it very often as as switch, when I test some new feature and have to incrementally introduce some code, so I comment out the code, then less, less, less, and finally I have working piece and I can remove all the comments (over code). My perfect language will of course support nested comments :-). – greenoldman Dec 04 '17 at 16:07
  • @greenoldman: Most languages don't have nestable comments, but they'll have some actual feature for "remove a block of code" which is less-used than the "leave a comment" feature. C's `#if DEAD` is the canonical and best-designed example. In many languages you can just wrap the dead code in the equivalent of `if (DEAD)`. And in many IDEs, you can *actually remove the dead code* and rely on Ctrl+Z and/or version control to get it back if you want it. Leaving a comment, docstring, whatever, whose text is a bunch of dead code, is still the *worst* option for readability. – Quuxplusone Dec 04 '17 at 19:10
  • 4
    Believe it or not, most languages are **not** this opinionated. – user253751 Aug 13 '21 at 12:42
  • 1
    I also disagree about not commenting code. It's very useful in development and has occasional uses in production. (Mostly of the this-obvious-solution-doesn't-work as a warning not to clean up what looks like bad code.) In C# it's not a problem--just use // type comments. – Loren Pechtel Aug 22 '21 at 00:57
  • 2
    It might be a bad idea to comment out code, but that is not the *reason* comments don't nest in C-derived languages. The reason is that comments are discarded in the lexing stage and lexers were traditionally implemented as a finite state machine which does not support nested structures. This was to keep the scanner and compiler simple and fast which was important on the limited hardware at the time. Probably not important today, but many languages with C-derived syntax have kept the syntax for familiarity. OTOH several newer languages does support nested comments. – JacquesB Aug 22 '21 at 10:54
  • Nested parentheses are matched at the *parsing* stage according to the grammar. You can't do that with nested comments, since the content of the comment is not valid syntax in the language. You *will* have to special case parsing of comments. – JacquesB Aug 22 '21 at 11:09
  • This answer is terrible because "first let's agree on ..." and then something ridiculous that I definitely don't agree on. It's a strawman argument. I do agree that the terribly formed comments written above are bad, but I don't agree that's what programmers structurally use nested block comments for. Nested block comments are useful when you have some code for debugging or other experimentation. You comment out a whole chunk of code spanning many lines, which itself may contain a comment. That's what it's for. Not the silly strawman described above. – Edward Ned Harvey Aug 21 '23 at 15:06
2

Supporting nested block comments complicates the parser, which is both more work and it could increase the compile time. I guess it is not a very needed feature for a language, so it is better to use the time and effort on other improvements and optimizations.

In my opinion simplicity is always a good thing in designing anything. Keep in mind that it is easier to add a feature than to remove it. Once you allow nested comments and there are programs out there using it, you won't be able to take them out without breaking compatibility.

alexrs
  • 29
  • 2
  • 1
    +1 for "easier to add a feature than to remove it". – R.. GitHub STOP HELPING ICE Jun 08 '11 at 00:49
  • 4
    once you disallow nested comments you can't allow them as well because it'll break such comments: `/*/**/` – RiaD Jun 19 '14 at 20:23
  • @alexrs — it's not that much work to keep a counter of occurrences of comment start tokens and decrease it whenever a comment end token is encountered. Speed is not a thing here, compile time wouldn't be affected. That being said, I agree with simplicity in design, but personally find it useful to be able to comment out code *while still coding* (commented out code would stay there only temporarily and be removed as soon as possible, without making it into releases). But design–wise, in this specific case, adding nested comments later as a feature would as well break compatibility. – Davide Cannizzo Jul 08 '21 at 19:46
1

One probable reason is that nested comments must be handled by the parser, since the flavor of regular expressions commonly used in lexers don't support recursion. The simple ones can be eliminated as whitespace by the lexer, so they're simpler to implement in that way.

hammar
  • 893
  • 6
  • 12
  • 3
    It's not the "flavor". The word "regular" in regular expression inherently excludes recursion. – R.. GitHub STOP HELPING ICE Jun 08 '11 at 00:48
  • 4
    @R: In mathematics, sure. But in programming, we have things that we call regexes that do support recursion. – amara Jun 08 '11 at 19:33
  • The question is: Is this even an issue? Most languages already have to deal with nesting parentheses. To name some: Lisp, C, Java, Python, Ruby, Perl. – Thomas Eding Apr 23 '14 at 20:34
  • Nested parentheses are fine, because the things inside the parentheses are the same as the stuff outside: normal tokens. In comments, you don't have tokens, you just have text. You need to be able to match the start and end comment tokens so that you know whether 'int' is a type or just a word in a comment. (Especially if you eliminate comments in the lexer.) – Alan Shutko Jun 19 '14 at 19:12
  • @R.. Are you sure "regular" means "non-recursive"? To me "regular" is just the adjectival form of "rule": regular expression = an expression of a rule about how to match. – ThePopMachine Jun 19 '14 at 20:48
  • 2
    @ThePopMachine: I'm sure of what I stated, that regular has a defined formal meaning, not the meaning you're using, and that the "regular" in "regular expression" was chosen for this meaning. Being non-recursive is one result of its definition. – R.. GitHub STOP HELPING ICE Jun 19 '14 at 20:57
  • 1
    @R.. : Maybe you're right about the origin of the term, but you're the one who picked on the word "flavour". In a programming context, where regular expression clearly doesn't (always) means what you say, why would you pick on what is effectively just a clarifying term. – ThePopMachine Jun 20 '14 at 14:45
0

Languages with nested comments: D, Haskell, Raku, Swift, Julia, Dart, Kotlin - to pick a few common ones. C has #if 0 ... #endif which works similar to nested comments, which means C++ also has it.

So it seems like the modern style is to include them as they're generally seen as a good idea. And comment syntax evolves, remember C didn't originally have // but rather that comes from C++.

There are some claims elsewhere that lexing would be harder with nested comments. Given that it's simple enough to do it manually AND it's straightforward to do in a tool like flex (for a solution, see this Stack Overflow question: https://stackoverflow.com/questions/34493467/how-to-handle-nested-comment-in-flex) it seems simple hard facts contradict that idea.

Not to mention that most popular languages use handwritten parsers and lexers: https://notes.eatonphil.com/parser-generators-vs-handwritten-parsers-survey-2021.html.

Nuoji
  • 101
  • 5
  • If a lexer keeps track of nested structures then it is by definition not a lexer anymore. – JacquesB Aug 11 '21 at 14:36
  • A lexer is a program that takes characters and transforms them into tokens. I have no idea what you got your definition from, but it is the first time I ever heard of that idea. – Nuoji Aug 20 '21 at 18:46
0

Once you define precisely what is and what isn’t a comment, it’s easy enough to implement. Nested comments would be slightly harder, but not so much that it would be a big deal.

So what’s the problem? Describing exactly what is and what isn’t a comment. And giving a description where you don’t have unexpected nested comments, when the user didn’t intend it.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
0

Who knows? I would guess because supporting nested comments is more work - you would have to maintain a stack of some sort, and because it complicates the language grammar.

Neil Butterworth
  • 4,056
  • 3
  • 23
  • 28
  • Not sure why this got downvoted. This is indeed at least part of the answer. At least historically, compilers have often been implemented as a "lexer" such as lex or flex (which typically understands a regular grammar, i.e., roughly, regexes *without* enhancements such as nesting) followed by a "parser" such as yacc or bison (which typically understands a LALR(1) grammar). A comment is typically recognized as a token in the lexer stage, which can't do nesting. So non-nestability of comments naturally falls out of these historical design constraints. – Don Hatch Jan 16 '20 at 10:30
-1

Nested comments mean extra work for the parser. Usually when you see the start of a comment you ignore everything until the end comment marker. In order to support nested comments you have to parse the text in the comments as well. The biggest issue, though, is that a programmer has to be careful to close all nested comments correctly or it will lead to compilation errors. Correctly implementing a compiler is something that can be done but keeping track of nested comments as a programmer is quite error-prone and irritating.

nobody
  • 848
  • 7
  • 12
Gus
  • 368
  • 1
  • 4