Why C++ to write a compiler?

Question

I was wondering why C++ is a good choice to write a compiler. Of course C is good for this purpose too, because many compilers are written either in C or C++ but I am more interested in C++ this time. Any good reasons ? I was looking for that in the Internet, but I cannot find any good reasons.

"Many compilers are written [...] in C++" - any references? Which ones? What makes you think C++ is more often used for compiler construction than other popular languages? — Doc Brown, May 26 '12 at 05:41
@DocBrown Well, Clang and MSVC are written mostly in C++, gcc have a bit of C++ in it now, Java JVM are written in C++ http://stackoverflow.com/questions/410320/what-is-java-written-in and also http://superuser.com/questions/136136/what-language-are-compilers-written-with — Klaim, May 26 '12 at 07:52
@DocBrown DMD the reference compiler for D is written in C++ — ratchet freak, May 26 '12 at 17:36
@Phil Do you think they made this choice without any consideration of alternatives? It's not a "good" choice, it's a "efficient" choice. — Klaim, May 27 '12 at 11:34
The reason for this is mostly historical. The original lingua franca for Unix is C, and a C compiler is almost always one of the first language to be ported to a new operating system. This makes language compilers that's written in C much more easily portable to new platforms than language compilers written in other languages. C++ is usually the next language to be ported because of its simularity to C. — Lie Ryan, Jan 29 '20 at 05:15

score 28 · Accepted Answer · answered May 26 '12 at 05:27

28

C++ has two sides to it. It has a low-level development side which makes it seem like a natural language for doing low level thing like code generation. It also has a high-level side (which C does not) that lets you structure a complex application (like a compiler) in a logical, object oriented way, while still maintaining performance. Because it has both the low and high level aspects to it, it's a good choice for large application which require low-level features or performance.

answered May 26 '12 at 05:27

Oleksi

11,874
2
53
54

11

As far as I know a lot of the logic inside a compiler is of functional nature (transforming complex data structures into other data structures) so I am not sure if object-oriented facilities (which are more targeted to programming-in-the-large, architectural aspects) bring a real advantage to compiler construction wrt to a procedural programming style. Just my 2 cents. – Giorgio May 26 '12 at 05:36
6

@Giorgio Having objects helps in a lot of other aspects of compiler writing. For example, there's a lot of state a compiler has to deal with when optimizing and that kind of stuff lends itself well to OOP. Also, OOP and Functional programming can be quite complimentary, so just because the algorithms might be mostly functional, doesn't mean that objects won't help. – Oleksi May 26 '12 at 05:39
1

I really do not have enough experience to give an authoritative answer (see my final comment: just my 2 cents). With _complex data structures_ I mean something similar to _having a lot of state_ that needs to be transformed. On the other hand, I cannot imagine that there is a lot of _behavior_, _polymorphism_, _inheritance_ in a compiler, but rather data structures, function application, function composition. Again, my very humble 2 cents, I really have little experience with compilers. – Giorgio May 26 '12 at 05:45
1

@Giorgio Really, you can write a compiler with mostly OOP or mostly Functional concepts, but the best solution would probably to use the best of both worlds. I wrote a toy compiler for a C-like language in Scheme once, and I wished that it had some Object Oriented features. Scheme handled some things very naturally (parsing and lexing comes to mind), but with others (specifically code generation and optimization), I found Scheme a little awkward. I though that some object oriented concepts would have really helped. – Oleksi May 26 '12 at 05:48
1

Makes me want to try out new stuff. I wish I had the time to work on a toy compiler. :-) – Giorgio May 26 '12 at 05:53
@Giorgio I did it as part of my degree. It's good fun. :) – Oleksi May 26 '12 at 05:54
4

@Giorgio and Oleksi: I can confirm both of you. I wrote a compiler with Haskell for a real world language. It was a really good fit. But sometimes I missed some OO around. If I had to write another compiler I'd definitely choose Haskell, but this is really a special case. I would not choose Haskell without hesitation for other types of projects. – scarfridge May 26 '12 at 07:19
1

@scarfridge: Great comment! I think there is no programming language that is good for all kinds of projects and the best thing we can do is to learn as much as we can and choose each time what we consider the most appropriate solution for a specific problem. – Giorgio May 26 '12 at 08:28
28

Why do you need to have a language with a "low-level side" to do code generation? I can't see how these two are connected in any way. – phant0m May 26 '12 at 08:52
6

You don't need a "low-level side" to do code generation any more than you need Unicode *identifiers* to be able to write Japanese text to a file. – dan04 May 26 '12 at 16:05
2

@phant0m: The "low-level" abilities make it possible to rewrite your hotspots (after you've found them with a profiler) for higher performance, at the cost of being less abstract and harder to maintain. – Ben Voigt May 27 '12 at 03:26
@BenVoigt That makes no sense. While you may be able to optimize the heck out of the time and/or memory requirements for generating the code, this does not explain why low-level programming helps with code generation. *Any* piece of code can be optimized somehow, code generation is almost universally a pretty simple step that's no bottleneck (*perhaps* in one-pass compilers, but even then I'd look into the parser first), and the speed of the compiler itself does not affect the speed of the generated code (which is usually more important). – May 27 '12 at 10:09
2

@delnan "Any piece of code can be optimized somehow" actually, no, and that's the point of getting low level AND high level abstractions to be able to optimize at the level that allow the most efficient optimizations (that can be at any level). You cannot have access to all optimization opportunities without access to all levels of abstraction. Also, C++ template language allow higher level of abstraction than most languages and that allow again more optimizations (see C++ std::sort vs C's Qsort). – Klaim May 27 '12 at 11:31
@dan04 Well in theory you don't need it, in practice you want it to be fast. – Klaim May 27 '12 at 11:35
2

@Klaim Yes, I'm sorry I didn't make that clear. Yes, you are right. But that applies to *any* task, it has nothing to do with code generation so it seems very strange to talk exclusively about code generation when discussing it. This answer speaks of low-level programming being a natural fit for "low-level thing[s]" like code generation. Since you seem to agree, could you please explain how this is the case? – May 27 '12 at 11:48
@delnan Maybe what is implied but not said explicitely is that most languages are not extremely fast to parse with most languages. For example, I worked on domain specific scripts and making the compiler in C# was a good choice because the script language was really really simple. But for generic programming languages, the complexity, even with languages which syntax is made to be easy to parse, is far higher than for little domain specific languages. So you need the compiler to be the faster possible. So you need both high and low level abstractions to allow optimization of the compiler. – Klaim May 27 '12 at 12:50
@Klaim Reading the answer again, I can't find anything which remotely hints at this intent - the *only* pass that's mentioned is code generation, and the performance of that has nothing to do with the performance of parsing. I want to understand why Oleski seems to state that code generation is a task for low-level languages. None of your comments help with that, please think twice before commenting further. – May 27 '12 at 13:19
@delnan That is not what I read in his answer. His answer talk about 2 parts, high and low level abstractions. What I think it tries to say is that having both together gives an advantage to get both complex structure but manageable (with high level abstraction) and optimization opportunity (with both high and low level abstractions). – Klaim May 27 '12 at 14:45
2

All that I meant to imply was that it felt more natural to do something like pushing out machine instruction bytes (which felt like a low-level operation) in a low-level language. When I did it with Scheme it felt awkward. Because manipulating bytes and pushing them out is a lot more common in C, than Scheme, there were better libraries available that made the code much cleaner. Perhaps the code might not have been as awkward in other high-level, or perhaps that I just didn't know Scheme well at the time. – Oleksi May 27 '12 at 16:29
@Oleksi Thanks for answering. I tend to agree with that observation, though a thin abstraction layer over the actual bytes (which is a good idea in any language as it's DRY and aids readability) can work wonders. `emit_ADD(reg1, reg2); emit_JUMP(label1)` reads pretty high-level to me and is pretty close to how at least PyPy's Python compiler reads. – May 28 '12 at 10:33
@delnan There's probably plenty more that I could've done to make it readable. I guess I'll know for next time. :) – Oleksi May 28 '12 at 13:46
1

Please try to avoid extended discussions in comments. To discuss this further please visit the chat room. – maple_shaft May 28 '12 at 18:38

score 18 · Answer 2 · answered May 26 '12 at 10:09

18

My experience does not agree with your premise here. In fact, for high-level general-purpose languages, it is a very common practice to write the compiler in the same language as the source language (the language being compiled). For example:

Sun's Java compiler is written in Java
The Scala compiler is written in Scala
Mono's C# compiler is written in C#
Squick's Smalltalk compiler is written in Smalltalk
... and many more

An exception is compiler front-ends written for existing compiler frameworks, such as GCC, LLVM or Polyglot, which are then written in the framework's language, or compilers that rely on existing parser generators such as Yacc. Since GCC, LLVM and Yacc are common, established tools written in C and C++, it gives an incentive to compiler writers to use them, which might lead to C and C++ getting a large share in the compiler implementation language distribution.

answered May 26 '12 at 10:09

Oak

5,215
6
28
39

2

I think that has much more to do with the people writing the compiler knowing well and liking a lot the language they are writing a compiler for than for objective technical reasons. – Andreas Bonini May 26 '12 at 18:40
1

@Krelp I agree it's not about an objective technical reason, but it's not really "liking", either - it's just considered some rite of passage for a language - "is it mature enough to be able to serve as the implementation language of its own compiler". – Oak May 26 '12 at 18:44
1

Sun's Java compiler is written in C++: http://stackoverflow.com/questions/410320/what-is-java-written-in – Klaim May 27 '12 at 11:36
12

@Klaim you are confusing two products here. One is Sun's Java compiler (`javac` command-line), which compiles Java to Java Bytecode. It is written in Java - I have modified it many times myself and you can [browse its Java sources online](http://hg.openjdk.java.net/jdk7/jdk7-gate/langtools/file/ce654f4ecfd8/src/share/classes/com/sun/tools/javac/). The other is the just-in-time compiler embedded in the Hotspot JVM, which compiles Java *Bytecode* to native machine code. Like most of the JVM it's written in C++, but it's **not a Java compiler** - in fact, it knows nothing about the Java language. – Oak May 27 '12 at 12:17
@Oak, absolutely correct! In other words, JVM != javac – Paul Draper Feb 21 '15 at 22:17
gcc and Clang both compile many languages. Yet they are written in one language only. – gnasher729 Jan 30 '20 at 19:05
It's not simply knowing the language well and liking it - there is a good software engineering advantage to bootstrapping a compiler in its own language - the compiler writers themselves _use_ the compiler _every day to get their work done_ - "eating their own dogfood" - therefore there is great incentive to get it correct first and high performant second. Or, compilers being very complex software, there could be a tendency to leave some bugs untouched as you work on something more "interesting" (and misleading yourself into telling yourself it is more "important"). Seen it happen many times. – davidbak Sep 02 '20 at 02:13

score 6 · Answer 3 · answered May 26 '12 at 05:26

To compile what to what? A compiler transforms a source code from one language (source language) to another (destination language), which doesn't indicate anything about the low-levelness of the destination language.

CoffeeScript compiles to JavaScript, the compiler being written in CoffeeScript.
Script# compiles C# into JavaScript, the compiler being written in, if I remember well, C#.
etc.

The language you pick to write a compiler depends on the context. For example, working on a project which compiles a language derived from PHP to a native PHP code, I used a mix of PHP and C# to write the compiler, because it made the most sense for me given my skills. Another person would pick Python, or Java and PHP, or C++ with a bit of JavaScript, or whatsoever.

C or C++ is a popular choice because of the support of compiler related tools (see the answer by Telastyn), and because those two languages allow you to go really native. But there is nothing wrong in choosing another language.

Note that in order to be more geeky, you may pick the source language to write the compiler itself. It's what happened for CoffeeScript compiler and many other compilers. It is also popular with the IDEs: one of the first Visual Studio was built using the same Visual Studio.

Self-hosting is not geeky, it is an important property for porting a compiler. — , May 26 '12 at 08:49
The reason is, that it immediately enables the compiler itself to be a test program. It will most likely also be the largest program for that compiler for quite a while. — , May 26 '12 at 10:27

score 6 · Answer 4 · answered May 26 '12 at 06:01

I tend to question the basic premise here. While C and C++ work perfectly well for writing compilers, quite a few other languages seem to work perfectly well for the task as well.

A bit depends on the language you're compiling though. For small, simple languages, C and Pascal work quite nicely. If you're going to compile something big and complex, your compiler gets big and complex too -- in which case, C++'s extra features for organizing and working with larger programs obviously come in handy. That isn't really very specific to compiling though, just features useful for larger programs in general.

I think it's also worth mentioning one other point. Beginners (seem to) think of compilers as mostly doing text manipulation, so they think something like Perl will be a massive help in writing compilers. In reality, most of the interesting parts of compilation don't really start until after you've built your AST. While I'm sure Perl can do the job perfectly well, its text manipulation capability doesn't really give it a huge advantage either (text manipulation is mostly in the lexer, and lexer generators for things like C all support REs anyway).

AST = Abstract Syntax Tree, RE = Regular Expressions – chaotic3quilibrium May 29 '12 at 14:50 — chaotic3quilibrium, May 29 '12 at 14:50

Lior Kogan · Answer 5 · 2012-05-26T19:04:36.303

5

Compilers may be implemented in any modern language. However, one of the most important requirements from a compiler is to be fast.

C++ has a clear advantage here. Optimization in C++ does not come cheap. However, due the the low-level nature of this language, it is possible to manually optimize C++ code more than in any other language (except Assembly which is not portable).

edited May 26 '12 at 19:04

answered May 26 '12 at 05:25

Lior Kogan

1,467
12
12

11

Another important requirement is for the code generated to be correct - I'd rather have a slow compiler I can trust than a fast one which generates incorrect code. – May 26 '12 at 10:24
2

While it is certainly _possible_ to optimize C++ very heavily, there's a lot of rather… well… less than optimal C++ code out there. – Donal Fellows May 26 '12 at 18:29
2

@DonalFellows Turn it the other way around: it is possible to write less than optimal code in any language, but there are optimizations that are impossible to enable in other languages than C++ (other than Assembler. I don't include C because of lack of high level structures allowing stronger inlining). – Klaim May 27 '12 at 15:30
@user1249 - There is no reason that the speed of C++ code would make it any ore buggy. I'd rather have a fast, correct compiler than a slow, correct compiler. – gnasher729 Jan 30 '20 at 19:07

score 3 · Answer 6 · answered May 26 '12 at 04:25

3

I suspect that the prime motivator for their use is that Lex/Yacc/Bison output is (primarily) in C. Since that's been the standard for so long, it has momentum.

Not that those are particularly good reasons...

answered May 26 '12 at 04:25

Telastyn

108,850
29
239
365

Actually it doesn't satisfy me, but thanks for try. – terenaa May 26 '12 at 05:23
That does not answer the question "why to choose C++ over C for compiler construction". – Doc Brown May 26 '12 at 05:43
3

It's not a good reason at all. Analogous tools to Lex and Yacc exist for many platforms. PLY and ANTLR, for example. – user16764 May 26 '12 at 05:58
Moreover, most popular real-world compilers (I'm pretty certain abuot Clang and GCC, for instance) use hand-written parsers. – May 26 '12 at 08:55
@delnan: Yes but they probably started off using a generated one to get things off the ground. The hand generation of the parser is an optimization step you don't really want to do until you can prove other things are working. – Martin York May 27 '12 at 01:43
@LokiAstari You have a point (certainly with GCC, though I think Clang was hand-written from the very beginning). I may be the exception, but for me it's easier to hand-write a pratt parser (basically, recursive descend but handles associativity and precedence very easily) to spending days (I'm no good at this) to transform the grammar I want into one a parser generator accepts. – May 27 '12 at 09:53

score 1 · Answer 7 · answered Jan 29 '20 at 04:53

I have experience with this matter. I have written compilers in C and C++. The main difference between C and C++ is that C does not have dynamic memory management in an automatic way. All memory management in C has to be done explicitly. Writting a compiler deals a lot with string processing and array management. In C you are forced to think about the size of every string and every array you declare and also check indexes when you access those objects (if you want your code to be safe and stable). In C you can have dynamic memory management, of course, but nothing is automatic. You have to explicitly allocate and free memory using malloc() and free(), keep the size of your dynamic objects in separated variables in order to be sure you do not access them out of boundaries.

In C++ you can have the same mechanisms but it is really development time efficient because all your memory management can be encapsulated within constructors and destructors which you do not have to call explicitly. So the compiler is allocating and freeing resources for you. The size of your dynamic objects can be encapsulated as well if you create your own classes, and indexes can be checked for boundary access by overloading operator []. These abstractions help to make your code cleaner, easier to understand and debug and makes definitely development faster.

If you create a compiler in C it will take you more time for sure. C++ will make you finish your project in less time. C and C++ have same performance but C++ has a lot of advantages that C does not have.

Basile Starynkevitch · Answer 8 · 2020-01-29T14:25:33.697

The CompCert project is a research C compiler which is not written in C or in C++, but more in Ocaml and Coq.

Observe that C++ used to be translated to C (in Cfront). Now you may use the GCC front-end to Gimple, then dump the Gimple to some database, then write a Gimple to your assembler translator. But legal reasons (the GCC runtime library exception) requires such a compiler to be open source. Ask your lawyer for details, I am not a lawyer. Old variants of GCC have been written in C (+ several domain specific languages) with a front-end for some variant of C++. OpenWatcom could be a C++ compiler written in C (I leave you to check that).

The source of Compcert are freely available for academic and research purposes. If you want to use it industrially (and legally), you need to get some license from Absint.

See also this and that answers to two related questions.

If I was tasked in 2020 to write a C (or C++) compiler from scratch (running on Linux, maybe some cross-compiler) I probably won't write it in C++. I would consider writing it using Ocaml, Go, or Rust. And I could base it on Frama-C if so allowed. If required to code in C or C++ I would first code a garbage collector library for it, probably some persistence layer -very useful for whole program optimization- and then I would consider a metaprogramming approach (generating most of the C or C++ code of the compiler with my ad-hoc tools, maybe Bismon or RefPerSys if so allowed).

You could find some (more or less open source) C compilers coded in Common Lisp or in Python (e.g. ShivyC or nqcc). Look also into ZetaC.

Notice that recent versions of GCC are technically not coded in pure C++, they are a dozen of domain specific languages involved in GCC (several of them being Turing-complete). See also my old GCC MELT project.

I won't be surprised if, in future versions of GCC, some Python or Guile interpreter would be embedded inside them (for example, as a replacement for the pass manager of GCC).

Look also into the MILEPOST GCC project.

Why C++ to write a compiler?

8 Answers8