GCC or Clang to output bytecode for a VM?

Question

Long story short, I wanted to use C as a scripting language and in a portable manner so I created a register-based, JITless VM to accomplish this.

I've formalized the VM's ISA, script file format, calling conventions, and even created an assembler that assembles the human-readable bytecode to numeric bytecode, now I need a compiler that can target my VM.

I have considered whether to create a GCC or an LLVM backend to accomplish this but I've never undertaken such a project before. Should I create a GCC backend, an LLVM backend, or are additional options I could choose?

link to vm/runtime env

Is your JITless VM available as some free software project? I could be interested by it. — Basile Starynkevitch, Feb 02 '19 at 17:51
@BasileStarynkevitch it's MIT licensed and it's free of charge. I'm not 100% sure if that's considered free software but if you're interested here's the github link -> https://github.com/assyrianic/Tagha — Nergal, Feb 03 '19 at 20:05
That link should go into the question. BTW, MIT license is an open source license. — Basile Starynkevitch, Feb 03 '19 at 20:07
Creating either a GCC or LLVM backend is a pretty big project. Why not instead write an _interpreter_ for your bytecode. Incorporate that into some real project (or two) and _use_ your language to script that project for awhile. Only then will you see what's working and what's not working with your VM and where it can be improved. Make those improvements and iterate. _Then_ you'll be ready to write a native-code-generating backend for it ... if you still need it! (It may be that for scripting purposes your interpreter is fine ...) — davidbak, Feb 28 '21 at 16:59

valiano · Accepted Answer · 2019-02-08T07:34:31.123

I would strongly advocate for LLVM.

LLVM is the future, and IMHO is favorable for new compiler implementations compared to GCC.
Here's why:

Modularity:
- LLVM is a modular compiler framework, much more so than GCC, which is a monolothic software project and not nearly as flexible. You could implement components and re-use them as libraries, plugins or "tools". From this reason it's fairly easy to add new tools and modules to LLVM and re-use existing functionalities. For example: LLDB, the LLVM debugger, takes advantage of clang's infrastructure of displaying accurate and rich C++ debugging information, not having to implement it from scratch. Recommended read: Clang tutorial part I: Introduction (Bits, Bytes, Boos).
Maintainability:
- LLVM byetcode and IR should be easier to work with than their GCC's counterparts (RTL, GIMPLE and such), which are less friendly and would probably require a steeper learning curve.
- GCC has been around for many years, and as such it has huge parts of legacy code which makes it more difficult to improve and less flexible.
- Most of GCC is still written in C and C-like C++, whereas LLVM is written in a modern C++ dialect.
- LLVM is in the final process of moving its repositories from SVN into a single Github mono-repo (https://github.com/llvm/llvm-project). This is good news for Git fans and Github capabilities may be used to improve collaboration in the future. GCC is still currently hosted on SVN.
Adoption:
- While GCC is still going strong and has a stable developers community, the LLVM community is growing much more rapidly: GCC vs LLVM Q3 2017: Active Developer Counts.
- Most new languages compilers are implemented on top of LLVM: Swift, Julia, Rust, Haskell and Kotlin, just to name a few (see LLVM on Wikipedia for more). It is also the most popular choice for implementing static analyzers, optimizers, and many other tools (check out the Projects with LLVM page). One of my favorite projects is Emscripten, which can convert LLVM bitcode compiled from any supported language into Javascript.
- More and more companies are moving away from their legacy propietary GCC compilers towards LLVM based compilers.
License:
- GCC has the more strict GPL3 license, as opposed to the BSD license used by LLVM. Not taking a stand to which is better, the BSD license allows for more flexibility and not requiring you to disclose your source code, if you choose too. This is actually one of the main reasons big companies are moving away from GCC towards LLVM (and what got Apple to sponsor LLVM to begin with).

Plus, there are many resources, documentation and tutorials on LLVM, specifically for backend bring up, for example: Tutorial: Creating an LLVM Backend for the Cpu0 Architecture.

From all reasons above, I warmly recommend LLVM for backend development.
I'm not familiar with TCC mentioned above, but if you have serious plans to your framework going forward, than LLVM is probably the safer bet.

Edit:

As mentioned by Alex Reinking, LLVM is a very dynamic project and is moving forward really fast. Keeping up with the latest tree requires constant maintanance, which should be taken into consideration. LLVM is alive and kicking and things are constantly changing, so it is not unlikely for internal APIs to break between versions.

Caveat: the actual LLVM language is a moving target and the developers don't hesitate to break APIs between minor releases. Writing an LLVM backend requires constant maintenance to keep up with new versions. — Alex Reinking, Jan 14 '19 at 08:18

Basile Starynkevitch · Answer 2 · 2019-02-06T20:16:16.617

Long story short, I wanted to use C as a scripting language and in a portable manner so I created a register-based, JITless VM to accomplish this.

Notice that compiling C code is easy. But making an optimizing compiler is hard. See this answer.

So, you probably don't care of performance (since you want to use C for scripting purposes), and I guess that you'll focus your optimization efforts inside your VM. Then coding a C compiler is in principle easy (look into tinycc, also the TinyCC wikipedia, for a good example; and also into nwcc). But there is still the issue of what undefined behavior means in your implementation.

I believe that using C as a scripting language is a mistake (but see this). You'll better use some higher level semantics (and you might still keep a C like syntax). In particular, I believe that a scripting language should have few undefined behavior (ideally none). In practice, you don't want to crash your VM with a faulty script.

You don't need LLVM and you don't need GCC for your particular purpose (both are overkill). You do need some C standard library. Parsing C (or some reasonable subset of it) is quite simple, and writing a naïve C compiler is a simple exercise (what is difficult is making an optimizing compiler, and you probably don't need any).

You could however use libgccjit or LLVM (or asmjit or GNU lightning) in your VM implementation for JIT-ing purposes (but that is not the question you are asking).

BTW, you might simply (if your VM has the same word size, alignment constraints, and endianness than your native machine, or as some target machine supported by some existing GCC cross-compiler) prototype your C compiler as a GCC plugin working on Gimple. In practice, most of Gimple is quite stable (perhaps more stable than the internals of LLVM).

Thank you Basile. I should say that I don't have any working nor official C front end that targets my VM. As for undefined behavior, I've been doing as much as possible to trap those such as what I've done with bad pointers by using a memory pool to bounds check. I understand that there's more undefined behavior than just bad pointers but I believe those can be handled confidently in the future. — Nergal, Feb 04 '19 at 22:21

aerohammer · Answer 3 · 2019-01-15T20:27:41.943

3

Between GCC and LLVM, I'd suggested looking at LLVM first. GCC is apparently known to be cumbersome to develop, whereas some of LLVM's code was specifically written in reaction to that very same cumbersomeness.

That having been said, what I suggest is to write a backend for TCC, the Tiny C Compiler. It does originate from an obfuscated C contest entry, but that was in 2001, and improving the clarity of the codebase has been an ongoing project ever since. TCC itself doesn't have very many optimizations, but it does optimize some (or all?) constant expressions away.

TCC doesn't have a high-activity mailinglist, but it is somewhat active, is actively (albeit slowly) maintained & used in several projects (there is a Python binding, for example), supports most of C99, and most importantly of all, is fairly simple. It also has some level of support for several targets, so you can get an idea of what's involved from browsing the files used for such.

edited Jan 15 '19 at 20:27

answered Jan 13 '19 at 03:28

aerohammer

334
1
6

An interesting answer to suggest a backend for TCC. I should say that my VM also does closely follow the C abstract machine while also being x86-like so perhaps you might be correct on that. Thing is, how widespread is TCC's use? As far as I know, GCC and Clang/LLVM are widely used and do powerful optimizations. – Nergal Jan 13 '19 at 03:53
TCC probably wouldn't qualify as widely-used, though it's actually hard to be certain. You'd need to go looking for all of the programs that use somehow, which is complicated by it's dynamic library version. The thing is that it should be much easier to add your backend to (or rip the parser out of) due to simplicity... and using it as a library is also common, which I imagine would be useful to you. – aerohammer Jan 13 '19 at 05:30
Also, while GCC and LLVM are widely used, writing new backends isn't common. LLVM "assembly" isn't that hard to work through, but you'd be stepping a bit back from C, and I expect that there's complications that I just never found due to never looking past the intermediate code. – aerohammer Jan 13 '19 at 05:33
2

-1. GCC's codebase has some issues, but it's far from being deliberately _obfuscated_. It's got more backend hardware support than LLVM at the moment, so it's not like it's impossible to write new backends. – Alex Reinking Jan 14 '19 at 08:16
Revised. Regardless, I have actually read the obfuscation charge somewhere. As I recall, the accusation said it was to restrict commercial operations from clean-rooming their way to a proprietary reimplementation in the 80s or early 90s. Certainly could have been a misunderstanding of the code, though. – aerohammer Jan 15 '19 at 20:31
GCC is not obfuscated code. But some of it is legacy. And GCC, even today, optimizes *slightly* better than LLVM in some cases. OP probably don't care about optimization (which is the hard part of any C compiler). – Basile Starynkevitch Feb 02 '19 at 17:56
@ Basile Starynkevitch : I edited that last month. – aerohammer Feb 02 '19 at 22:28
Actually, I DO care about optimization but I'm looking towards the compiler creating optimized bytecode or rather optimized for size. – Nergal Feb 05 '19 at 06:03

score 0 · Answer 4 · answered Feb 28 '21 at 13:35

0

The easiest thing would be to write a webassembly compiler for your VM. If using your VM isn't necessary, you can use an embeddable webassembly runtime.

answered Feb 28 '21 at 13:35

Mattice Itamanyo

9
1

This is a pretty good suggestion! – davidbak Feb 28 '21 at 17:00

GCC or Clang to output bytecode for a VM?

4 Answers4