3

I'm really interested in writing my own general-purpose high-level programming language, but I'm somewhat confused.

I know that Python and Ruby were written in C, which makes me wonder that if I want to write my 'own Python', is it preferable to use a source-to-source compiler in order to translate all the source code of my language to C, or should I target assembly language?

The point is, I know that I need to dig in Compiler Design and understand Lexical Analysis, and the whole process of parsing the code and generating tokens and an intermediate language and checking for syntax and semantic errors and then generating output code.

However, since I'm not an expert in low level / assembly programming, should I use a source-to-source compiler? What challenges might I face if I attempt to compile my language to assembly? What drawbacks might there be in using a source-to-source compiler? What domain specific facets of my situation should I consider when making this decision?

durron597
  • 7,590
  • 9
  • 37
  • 67
  • 2
    Related: [How do I create my own programming language and a compiler for it](http://programmers.stackexchange.com/q/84278/64132) and [How is it possible to write the compiler of a programming language with that language itself](http://programmers.stackexchange.com/q/167277/64132) – Dan Pichelman May 26 '15 at 23:22
  • 6
    What you should do depends entirely on what you want to achieve, but it's probably best to start with source-to-source just because that's orders of magnitude faster and easier to do decently. Plus, a lot of the assembly-level optimizations are useful for any high-level language, so it's common to [write a new GCC frontend](http://www.tldp.org/HOWTO/GCC-Frontend-HOWTO.html) rather than a whole compiler from scratch. – Ixrec May 26 '15 at 23:25
  • 2
    The very first Haskell compiler was probably *not* written in Haskell. See http://en.wikipedia.org/wiki/Bootstrapping_%28compilers%29 – Robert Harvey May 27 '15 at 00:07
  • 1
    These days, most new language development work seems to be done using LLVM, which operates on the same principle as a new GCC front-end as @ixrec suggests, but is designed from the ground up to be more language-agnostic. – Jules May 27 '15 at 00:16
  • 2
    @Jules: LLVM appears to be awesome. Alas, it also appears to be allergic to Windows. – Robert Harvey May 27 '15 at 00:25
  • LLVM really seems very interesting. I was also checking [JavaCC](https://javacc.java.net/), which seems to be very interesting as well (And because the whole idea of writing a language that has native access to all the wonderful Java default library and that has all its portability by default, sounds awesome). – Ericson Willians May 27 '15 at 00:30
  • I'm also checking [Lex and Yacc](http://dinosaur.compilertools.net/). Although I'm interested in knowing how to write my own lexer / "tokenizer", it really seems very interesting to start playing with ([It also has a Python adaptation](http://www.dabeaz.com/ply/)). – Ericson Willians May 27 '15 at 00:47
  • 1
    @RobertHarvey I've been using LLVM under Windows (via the cygwin port) for some time now, and have had no issue with it. I also understand it can be used with Visual Studio (see http://llvm.org/docs/GettingStartedVS.html) but have never tried this. – Jules May 29 '15 at 09:01
  • @RederickDeathwill I edited the parts of this question that are opinion based, and removed the parts that are duplicate. Feel free to edit it yourself if it no longer asks the question you want to ask, but please leave the opinion based and off-topic story pieces out of it. – durron597 Jun 01 '15 at 15:58
  • @Jules: Did you have to build everything to get it to work on Windows? I don't think they provide Windows binaries, and if you're a beginner to the C++ ecospace, a complete build is a bit... onerous. – Robert Harvey Jun 01 '15 at 16:39
  • @RobertHarvey There's a build included in cygwin, maintained by the cygwin team; that's the one I've been using. I don't know anything about official builds, as I've never seen the need to use one. – Jules Jun 07 '15 at 01:31

6 Answers6

13

I am going to focus on your core question, since other stuff has been answered elsewhere.

Should you target a higher level language or assembly?

Getting software done is hard. While making a new language can be pretty easy, you need to stick to simple stuff and avoid the stuff that is a pain to implement. Making your first language has the problem that you don't know what the "pain to implement" stuff is. And let's face it, you're not aiming making a new language that can only implement console based calculators - the interesting stuff is going to be non-trivial to implement.

So do yourself a favor and target a language that you already know. Making a new language and a functional compiler is hard enough without adding "learn assembly" to the list of tasks. By setting yourself up for success, you're more likely to have fun and learn from the endeavor.

Telastyn
  • 108,850
  • 29
  • 239
  • 365
  • I disagree if he (the asker) is "really interested in writing" his "own general-purpose high-level programming language" then he would probably be interested in learning assembly and how the CPU works and CPU level optimizations. – ALXGTV Jul 02 '15 at 21:33
10

You should consider compiling to C (see this and that answers), or to some other languages (Java, Common Lisp, Ocaml, C++, or even a mix of Javascript & C - like HOP does... Notice that RefPerSys is generating C++ code, or in 2022 C code here). You could also compile to the textual representation of LLVM bytecode, or use the LLVM library as your backend, or (if having GCC 5 or better) use the libgccjit (to target GCC internal representations and profit from GCC optimizations). You could also choose some existing bytecode (e.g. JVM, Ocaml, Neko, Parrot, ...) and compile to it. And you might also use some JIT library like libjit, GNU lightning, asmjit etc...

Lexing & parsing are not the main work of a compiler or an interpreter. They are the simple parts. A compiler is mostly transforming (very often in several passes) some internal representations (in particular Abstract Syntax Trees, but not only them) of the source code that it is compiling. An interpreter is often transforming some internal representations, then traversing others (e.g. some bytecode, or some normalized AST). Play with GCC's -fdump-tree-all option, and perhaps with MELT (a Lisp-like DSL to inspect and/or transform GCC internal representations). The semantics of your programming language matters more than syntax.

An important part is memory management. Do you want a garbage collector (it is a core part of your semantics)? What about typing (static or dynamic) & type inference? Do you handle tail-calls? Do you want homoiconicity? metaprogramming? Do you want closures (they most often need a GC)? Consider Boehm's conservative GC, and/or read the GC handbook.

Bootstrapping compilers is important. See also this and the references I gave there. Read also this & that answers explaining technical & practical details (and should heal your headache about "Haskell written in Haskell", "Ocaml written in Ocaml", "MELT written in MELT", "CAIA written in CAIA", "GCC or Clang/LLVM written in C++").

Also, if you know none of them, play with Ocaml, or Common Lisp, or Haskell, or Scheme (see also SICP). Read Scott's book on Programming Language Pragmatics and Queinnec's book on Lisp In Small Pieces and Pitrat's book Artificial Beings, the conscience of a conscious machine (and blog here).

Be sure to make your language implementation some free software (on http://github.com/ you'll find many other language implementations - e.g. compilers or interpreters).

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125
  • This answer as a whole is excellent - but the books mentioned in the next-to-last paragraph: I have them all, have studied them all, those books are _excellent_ and that's an excellent recommendation, – davidbak Dec 01 '21 at 18:32
  • Great answer with lots of interesting links ! Thanks ! I've signed in to this community, to vote it up.. – Goodies Dec 04 '21 at 10:58
3

I did some professional work three years ago that used reflection to understand a .NET interface and provide some CIL glue to mangle it onto a base class. It was an eye opener to the level of additional work required.

Most software development, you focus on the success routes and if something unexpected happen you catch an exception. I found that the success route was less than 20% of the work. The usual strategy of catching exceptions doesn't work, as the exceptions occur at run time, not when you are generating the code. Instead you have to think of and check for any possible combination that could break your code, and then either support it, or fail the compilation. This is very different from application development.

If you are wanting to create a new programming language, and want it to be adopted by others, your best bet is to find a problem that cannot currently be easily solved. If you create a new programming language that provides a clean and simple approach to solve a complex problem, then people with that problem have a good reason to adopt your language.

Michael Shaw
  • 9,915
  • 1
  • 23
  • 36
2

"I'm really interested in writing my own general-purpose high-level programming language" Are you really? If you are truly interested and want to create your own language complete with a (native) compiler for the experience you would probably be interested on reading up on assembly and how the CPU works, you can examine the assembly output of existing high level languages to give you an idea.

A source to native machine code compiler gives you a lot more freedom to define your language's semantics, if you compile to C you are basically limited to the C way of doing things, so in the end features you may want to include such as proper tail calls will be out of your reach. Building a native compiler is also likely to be a more rewarding and enlightening experience, whereas compiling to C you are basically using something somebody else built C is a very complicated language for something to output to (it's syntax is designed to be good for humans to write not easy for machines to generate), you have to worry about nesting order of declarations, making sure you specify enough information to allow the C compiler to optimize your code effectively (e.g. using restrict when there is no aliasing of pointers).

If you're pragmatic and want to build a language which is going to be used in the near future and you're not really interested in building the compiler as much having a working compiler for your language then source to source is the way to go, however be aware that your language effectively becomes a preprocessor (in-fact a preprocessor is good place to start learning about lexical analysis and parsing). Either way if your goal is to distribute your language try to avoid it becoming a JAP language (Just another programming language) that is a language which offers nothing new and isn't a major improvement over existing languages.

ALXGTV
  • 1,475
  • 1
  • 12
  • 10
1

If you translate whatever you have to C, then you have a working system wherever a C compiler is available. Which is basically everywhere. You don't have to worry about different processors, different operating systems, and all the rest.

You may instead compile to C++, which today is almost as common as C, and gives you the advantage that you can hide work that you need to do in a class instead of having to repeat yourself. Important if your language has objects that cannot be translated to C or C++ primitives.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 1
    Not always true: generated C might not be portable (e.g. to other endianness, other word-size), and most compilers to C generate non-portable code. And generated C usually depend on libraries outside of the C99 or C11 standards (e.g. POSIX, etc...) – Basile Starynkevitch May 27 '15 at 10:43
-2

If you target LLVM, you still need to preprocess and actually compile the input, but output can then be in any of the outputs LLVM allows, be it assembly or JavaScript.

herby
  • 2,734
  • 1
  • 18
  • 25