282

Advanced compilers like gcc compile code into machine readable files according to the language in which the code has been written (e.g. C, C++, etc). In fact, they interpret the meaning of the code according to library and functions of the corresponding languages. Correct me if I'm wrong.

I wish to better understand compilers by writing a very basic compiler (probably in C) to compile a static file (e.g. Hello World in a text file). I tried some tutorials and books, but all of them are for practical cases. They deal with compiling dynamic code with meanings connected with the corresponding language.

How can I write a basic compiler to convert a static text into a machine readable file?

The next step will be introducing variables into the compiler; imagine that we want to write a compiler which compile only some functions of a language.

Introducing practical tutorials and resources is highly appreciated :-)

user438383
  • 103
  • 4
Googlebot
  • 3,173
  • 5
  • 19
  • 14
  • 6
    Did you see http://programmers.stackexchange.com/questions/66485/best-online-resources-to-learn-about-compilers and http://programmers.stackexchange.com/questions/138089/what-would-be-the-best-way-to-learn-about-compilers-and-executable-formats – Mat Sep 20 '12 at 16:06
  • Have you tried lex/flex and yacc/bison? – mouviciel Sep 20 '12 at 18:51
  • 21
    @mouviciel: That's not a good way to learn about building a compiler. Those tools do a significant amount of the hard work for you, so you never actually do it and learn how it's done. – Mason Wheeler Sep 20 '12 at 22:04
  • 21
    @Mat interestingly, first of your links gives 404, while the second is now marked as duplicate of _this_ question. – Ruslan Feb 10 '16 at 09:10
  • Too old answers. New approach - https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/ – T.Todua Dec 20 '20 at 15:51

7 Answers7

425

Intro

A typical compiler does the following steps:

  • Parsing: the source text is converted to an abstract syntax tree (AST).
  • Resolution of references to other modules (C postpones this step till linking).
  • Semantic validation: weeding out syntactically correct statements that make no sense, e.g. unreachable code or duplicate declarations.
  • Equivalent transformations and high-level optimization: the AST is transformed to represent a more efficient computation with the same semantics. This includes e.g. early calculation of common subexpressions and constant expressions, eliminating excessive local assignments (see also SSA), etc.
  • Code generation: the AST is transformed into linear low-level code, with jumps, register allocation and the like. Some function calls can be inlined at this stage, some loops unrolled, etc.
  • Peephole optimization: the low-level code is scanned for simple local inefficiencies which are eliminated.

Most modern compilers (for instance, gcc and clang) repeat the last two steps once more. They use an intermediate low-level but platform-independent language for initial code generation. Then that language is converted into platform-specific code (x86, ARM, etc) doing roughly the same thing in a platform-optimized way. This includes e.g. the use of vector instructions when possible, instruction reordering to increase branch prediction efficiency, and so on.

After that, object code is ready for linking. Most native-code compilers know how to call a linker to produce an executable, but it's not a compilation step per se. In languages like Java and C# linking may be totally dynamic, done by the VM at load time.

Remember the basics

  • Make it work
  • Make it beautiful
  • Make it efficient

This classic sequence applies to all software development, but bears repetition.

Concentrate on the first step of the sequence. Create the simplest thing that could possibly work.

Read the books!

Read the Dragon Book by Aho and Ullman. This is classic and is still quite applicable today.

Modern Compiler Design is also praised.

If this stuff is too hard for you right now, read some intros on parsing first; usually parsing libraries include intros and examples.

Make sure you're comfortable working with graphs, especially trees. These things is the stuff programs are made of on the logical level.

Define your language well

Use whatever notation you want, but make sure you have a complete and consistent description of your language. This includes both syntax and semantics.

It's high time to write snippets of code in your new language as test cases for the future compiler.

Use your favorite language

It's totally OK to write a compiler in Python or Ruby or whatever language is easy for you. Use simple algorithms you understand well. The first version does not have to be fast, or efficient, or feature-complete. It only needs to be correct enough and easy to modify.

It's also OK to write different stages of a compiler in different languages, if needed.

Prepare to write a lot of tests

Your entire language should be covered by test cases; effectively it will be defined by them. Get well-acquainted with your preferred testing framework. Write tests from day one. Concentrate on 'positive' tests that accept correct code, as opposed to detection of incorrect code.

Run all the tests regularly. Fix broken tests before proceeding. It would be a shame to end up with an ill-defined language that cannot accept valid code.

Create a good parser

Parser generators are many. Pick whatever you want. You may also write your own parser from scratch, but it only worth it if syntax of your language is dead simple.

The parser should detect and report syntax errors. Write a lot of test cases, both positive and negative; reuse the code you wrote while defining the language.

Output of your parser is an abstract syntax tree.

If your language has modules, the output of the parser may be the simplest representation of 'object code' you generate. There are plenty of simple ways to dump a tree to a file and to quickly load it back.

Create a semantic validator

Most probably your language allows for syntactically correct constructions that may make no sense in certain contexts. An example is a duplicate declaration of the same variable or passing a parameter of a wrong type. The validator will detect such errors looking at the tree.

The validator will also resolve references to other modules written in your language, load these other modules and use in the validation process. For instance, this step will make sure that the number of parameters passed to a function from another module is correct.

Again, write and run a lot of test cases. Trivial cases are as indispensable at troubleshooting as smart and complex.

Generate code

Use the simplest techniques you know. Often it's OK to directly translate a language construct (like an if statement) to a lightly-parametrized code template, not unlike an HTML template.

Again, ignore efficiency and concentrate on correctness.

Target a platform-independent low-level VM

I suppose that you ignore low-level stuff unless you're keenly interested in hardware-specific details. These details are gory and complex.

Your options:

  • LLVM: allows for efficient machine code generation, usually for x86 and ARM.
  • CLR: targets .NET, multiplatform; has a good JIT.
  • JVM: targets Java world, quite multiplatform, has a good JIT.

Ignore optimization

Optimization is hard. Almost always optimization is premature. Generate inefficient but correct code. Implement the whole language before you try to optimize the resulting code.

Of course, trivial optimizations are OK to introduce. But avoid any cunning, hairy stuff before your compiler is stable.

So what?

If all this stuff is not too intimidating for you, please proceed! For a simple language, each of the steps may be simpler than you might think.

Seeing a 'Hello world' from a program that your compiler created might be worth the effort.

Default
  • 103
  • 2
9000
  • 24,162
  • 4
  • 51
  • 79
  • 62
    This is one of the best answers I've seen yet. – gahooa Sep 20 '12 at 21:12
  • 13
    I think you missed a part of the question... The OP wanted to write a *very basic* compiler. I think you go beyond very basic here. – marco-fiset Sep 21 '12 at 12:01
  • 28
    @[marco-fiset](http://programmers.stackexchange.com/users/38148/marco-fiset), on the contrary, I think it's an outstanding answer that does tell the OP how to do a very basic compiler, while pointing out the traps to avoid and defining more advanced phases. – smci Mar 03 '13 at 22:50
  • @9000 All of this is meant to create a language that is compiled to some existing machine code, like the Java Byte Code, correct? – Aviv Cohn Mar 03 '14 at 08:58
  • 6
    This is one of the best answers I have ever seen in the entire Stack Exchange universe. Kudos! – airstrike Dec 08 '15 at 21:12
  • 6
    Seeing a 'Hello world' from a program that your compiler created might be worth the effort. -- INDEED – slier Aug 11 '16 at 02:56
  • I wish you could "favorite" comments.. – Matt C Nov 01 '16 at 13:36
  • 3
    This is perhaps the most complete and well written answer I've seen on the Stack! Especially in response to a very broad and nebulous question. I'm fascinated by the workings of compilers and Iearnt more in 5 minutes reading this than in hours of random googling. Bravo. – El Ronnoco Jun 24 '17 at 23:29
  • 1
    And in response to those saying that it goes beyond basic. A "basic compiler" is like saying a "simple theory of chemistry". It's not a basic topic. This answer lets you know the basics plus the intricacies. Double bravo. – El Ronnoco Jun 24 '17 at 23:32
  • 1
    Best answer on any stackexchange site I've seen so far – alejandrogiron Mar 24 '18 at 15:13
  • 1
    Dear 9000 , this answer is almost 10 years old. Can you please do a favor to community and update the answer, so it covers modern age approaches too (if there was any advance in that field). many thanks in advance! – T.Todua Dec 16 '20 at 11:37
  • 1
    @T.Todua: I don't have a lot to add, but certainly the answer can be improved; I'll try to do so. – 9000 Dec 28 '20 at 04:46
  • 9000 - there is excellent article : https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/ – T.Todua Dec 28 '20 at 08:21
39

Jack Crenshaw's Let's Build a Compiler, while unfinished, is an eminently readable introduction and tutorial.

Nicklaus Wirth's Compiler Construction is a very good textbook on the basics of simple compiler construction. He focuses on top-down recursive descent, which, let's face it, is a LOT easier than lex/yacc or flex/bison. The original PASCAL compiler that his group wrote was done this way.

Other people have mentioned the various Dragon books.

John R. Strohm
  • 18,043
  • 5
  • 46
  • 56
  • 1
    One of the nice things about Pascal, is that everything has to be defined or declared before being used. Therefore it can be compiled in one pass. Turbo Pascal 3.0 is one such example, and there is a lot of documentation about the internals [here](http://www.pcengines.ch/tp3.htm). – tcrosley Oct 20 '15 at 02:19
  • 1
    PASCAL was specifically designed with one-pass compilation and linking in mind. Wirth's compiler book mentions multipass compilers, and adds that he knew of a PL/I compiler that took 70 (yes, seventy) passes. – John R. Strohm Oct 20 '15 at 16:24
  • Mandatory declaration before use dates back to ALGOL. Tony Hoare got his ears pinned back by the ALGOL committee when he tried to suggest adding default type rules, similar to what FORTRAN had. They already knew about the problems this could create, with typographical errors in names and default rules creating interesting bugs. – John R. Strohm Oct 20 '15 at 16:25
  • 2
    Here is a more updated and finished version of the book by the original author himself: http://www.stack.nl/~marcov/compiler.pdf Please edit your answer and add this :) –  Oct 27 '16 at 15:33
16

If you really want to write machine readable code only and not targeted to a virtual machine, then you will have to read Intel manuals and understand

  • a. Linking and Loading Executable code

  • b. COFF and PE formats(for windows), alternatively understand ELF format(for Linux)

  • c. Understand .COM file formats(easier than PE)
  • d. Understand assemblers
  • e. Understand compilers and code generation engine in compilers.

Much harder done than said. I suggest you to read Compilers and Interpreters in C++ as a starting point(By Ronald Mak). Alternatively, "lets build a compiler" by Crenshaw is OK.

If you don't want to do that, you could as well write your own VM and write a code generator targeted to that VM.

Tips: Learn Flex and Bison FIRST. Then go on to build your own compiler / VM.

Good Luck!

user16764
  • 3,583
  • 1
  • 25
  • 22
Aniket Inge
  • 916
  • 7
  • 14
  • 10
    I think targeting LLVM and not real machine code is about the best way available today. – 9000 Sep 20 '12 at 16:38
  • I agree, I have been following LLVM for sometime now and I should say it was one of the best things I had seen in years in terms of programmer effort needed to target it! – Aniket Inge Sep 20 '12 at 16:40
  • 2
    What about MIPS and use [spim](http://pages.cs.wisc.edu/~larus/spim.html) to run it? Or [MIX](http://www.gnu.org/software/mdk/)? –  Sep 20 '12 at 16:57
  • @MichaelT I have not used MIPS but I am sure it will be good. – Aniket Inge Sep 20 '12 at 17:04
  • @PrototypeStark RISC instruction set, real world processor that is still in use today (understanding it will be translatable into embedded systems). The full instruction set is at [wikipedia](http://en.wikipedia.org/wiki/MIPS_architecture#MIPS_assembly_language). Looking on the net, there are lots of examples and it is used in many academic classes as a target for machine language programming. There is a bit of activity on it at [SO](http://stackoverflow.com/questions/tagged/mips). –  Sep 20 '12 at 17:15
  • Lex and Yacc are outdated. Now some much more pleasant ways to parse are available. And, actually, it worth skipping the parsing altogether, it is the least important thing anyway. – SK-logic Sep 20 '12 at 18:06
  • A workaround that I found neat when learning is to create a little "loader" stub in a higher-level language like C, and then make up your own executable format, and have the stub load that. Of course that only makes sense for simple uses, but it will get you one step further. COFF, PEF, ELF and Mach-O are still a big chunk to understand, and the docs are often scattered and ... nerdy. – uliwitness Jan 29 '17 at 11:48
14

I'd actually start off with writing a compiler for Brainfuck. It's a fairly obtuse language to program in but it only has 8 instructions to implement. It's about as simple as you can possibly get and there are equivalent C instructions out there for the commands involved if you find the syntax off-putting.

Flimzy
  • 704
  • 4
  • 13
11

DIY approach for simple compiler could look like this (at least that's how my uni project looked like):

  1. Define Grammar of the language. Context-free.
  2. If your grammar isn't LL(1) yet, do it now. Note, that some rules that looked ok in plain CF grammar may turn out ugly. Perhaps your language is too complex...
  3. Write Lexer which cuts stream of text into tokens (words, numbers, literals).
  4. Write top-down recursive descent parser for your grammar, which accepts or rejects input.
  5. Add syntax tree generation into your parser.
  6. Write machine code generator from the syntax tree.
  7. Profit & Beer, alternatively you can start thinking how to do smarter parser or generate better code.

There should be plenty of literature describing each step in detail.

MaR
  • 702
  • 5
  • 8
  • The 7th point is what OP is asking about. – Florian Margaine Sep 20 '12 at 18:50
  • 12
    1-5 are irrelevant and do not deserve such a close attention. 6 is the most interesting part. Unfortunately, most of the books follow the same pattern, after the infamous dragon book, paying too much attention to parsing and leaving code transforms out of scope. – SK-logic Sep 20 '12 at 19:51
5

I wrote a toy compiler. It's simple, though it's in F#, not in C. The repo is here: https://github.com/mykolav/coollang-2020-fs
(Sorry for a shameless plug!)

From my experience:

First. Don't start with classics textbooks like the Dragon Book, or Modern Compiler Implementation. These books are great, but you'll be better prepared to read them if, in the beginning, you go with the simpler ones linked below.

  • I would start with the book Crafting Interpreters by Robert Nystrom. Don’t let the word Interpreters in the book’s title discourage you. Compilers and interpreters have a lot in common and until you get to the topics specific to interpreters, you’ll gain a ton of useful knowledge.

  • Introduction to Compilers and Language Design by Douglas Thain is really accessible, doesn't assume any preexsiting compilers knowledge, and uses C for its code samples. So you might want to check it out.

Both are freely available online.

Second. Don't try to invent your own language just yet. Go with an existing educational language instead, and focus on learning about compilers.

  • ChocoPy is very well documented, has static typing, classes, dynamic dispatch, nested functions, etc. At the same time it manages to be concise and possible to implement in a limited time frame. As ChocoPy is a Python subset, you'll get for "free" tools like code formatters, online sandboxes, and syntax highlighting working on ChocoPy programs.

  • Cool 2020. It's a small — but not too small — Scala subset with classes, inheritance, virtual dispatch, and even a very simple form of pattern matching. Cool 2020 isn't as well documented as ChocoPy, but all the other advantages apply: tools for code formatting, syntax highlighting in editors, online sandboxes, etc.

You might also find exploring my project's source code useful.
Tried to describe it in detail here: https://mykolav.github.io/coollang-2020-fs/

It compiles down to x86-64 assembly. Then invokes GNU as and ld to produce a native executable. To me personally, this is much more rewarding than emitting MIPS asm and using an emulator to run it. I also prefer emitting asm to emitting, for example, C — at the very least, it forces the developer to figure out converting expressions into asm. That really drives home the point how much work high level langs do for us.

Myk
  • 171
  • 1
  • 5
0

Something that would actually be useful is a program that takes a problem description from a very limited problem set and turns it into c or C++ code that you then can copy and use.

I’d be very interested in a compiler that takes a less horrible language and translates it into shell code (just look at if statements in .sh file… absolutely hoorendous. Or when you can or must or must not put spaces around = or ==). I’ll write one when I’m retired.

gnasher729
  • 42,090
  • 4
  • 59
  • 119