135

If something can be generated, then that thing is data, not code.

Given that, isn't this whole idea of source code generation a misunderstanding? That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

Am I missing something here?

I know that code is data as well. What I don't understand is, why generate source code? Why not make it into a function which can accept parameters and act on them?

Utku
  • 1,922
  • 4
  • 17
  • 19
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/69553/discussion-on-question-by-utku-is-source-code-generation-an-anti-pattern). – maple_shaft Dec 01 '17 at 13:34
  • 14
    A term associated with the generation of code is [metaprogramming](https://en.wikipedia.org/wiki/Metaprogramming) – UselesssCat Dec 01 '17 at 15:23
  • 5
    https://en.wikipedia.org/wiki/Code_as_data , Lisp, FP, scripting, metaprogramming, Von Neumann/modified Harvard architecture etc. It's been covered *ad nauseam*. tl;dr the distinction "source code" vs "output code", "code" vs "data" etc. are meant to *simplify* things. They should never be *dogmatic*. –  Dec 01 '17 at 16:44
  • Can you explain why you think "If it's being done for performance reasons, then that sounds like a shortcoming of the compiler." I don't feel it sounds that way at all, so I'd like to hear more about what you're saying. – Cort Ammon Dec 01 '17 at 21:19
  • @CortAmmon The compiler's duty is to take a code written in human-readable form and convert it to machine-readable form. Hence, if the compiler cannot create a code that is efficient, then the compiler is not doing its job properly. Is that wrong? – Utku Dec 01 '17 at 21:33
  • 13
    @Utku, the better reasons to do code generation often relate to *wanting to provide a higher-level description than your current language can express*. Whether the compiler can or can't create efficient code doesn't really have anything to do with it. Consider parser generators -- a lexer generated by `flex` or a parser generated by `bison` will almost certainly be more predictable, more correct, and often faster to execute than equivalents hand-written in C; and built from far less code (thus also being less work to maintain). – Charles Duffy Dec 02 '17 at 01:04
  • ...whether your high-level language is transformed into a low-level language and then into an IL in your compiler and then from there into assembler and from there to machine language, or goes via pretty much the same pipeline but without the lower-level language (HLL -> IL -> assembly -> opcodes)... why is this a difference that even *matters to you*? You still have the same end result of high-level-language going in -> machine code coming out. – Charles Duffy Dec 02 '17 at 01:11
  • I think the "opposite" of data is not code, but *process*. Code fed to a compiler is data, and the compilation is the process, even though the compiler itself is also data/code being fed to a process (the computer). – Dave Cousineau Dec 02 '17 at 02:52
  • I've used source code generation to handle the interface between multiple systems for the communication systems so that I didn't have to write it in two places - I wrote one set that outputted the code in both system's languages. I found it kept me from allowing the two systems to accidentally get out of sync. – vextorspace Dec 03 '17 at 01:55
  • 1
    Maybe you come from a language which doesn't have many functional elements, but in many languages functions are first class -- you can pass them around, so in those types of languages code is data, and you can treat it just like that. – Restioson Dec 03 '17 at 09:04
  • 1
    @Restioson in a functional language code isn't data. First class functions mean exactly that: Functions are data. And not necessarily particularly good data: you can't necessarily mutate them just a bit (like mutate all additions within the functions into subtractions, say). Code is data in Homoiconic languages. (most homoiconic languages have first class functions. But the reverse is not true.). – Frames Catherine White Dec 04 '17 at 07:34
  • In some languages, code generation is built in: C and C++ have the preprocessor, there are various CSS preprocessors, Typescript can be transpiled to JavaScript. – Robert Apr 23 '20 at 15:04

27 Answers27

149

Is source code generation an anti pattern?

Technically, if we generate code, it is not source even if it is text that is readable by humans. Source Code is original code, generated by a human or other true intelligence, not mechanically translated and not immediately reproducible from (true) source (directly or indirectly).

If something can be generated, than that thing is data, not code.

I would say everything is data anyway. Even source code. Especially source code! Source code is just data in a language designed to accomplish programming tasks. This data is to be translated, interpreted, compiled, generated as needed into other forms — of data — some of which happen to be executable.

The processor executes instructions out of memory. The same memory that is used for data. Before the processor executes instructions, the program is loaded into memory as data.

So, everything is data, even code.

Given that [generated code is data], isn't this whole idea of code generation a misunderstanding?

It is perfectly fine to have multiple steps in compilation, one of which can be intermediate code generation as text.

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

That's one way, but there are others.


The output of code generation is text, which is something designed to be used by a human.

Not all text forms are intended for human consumption. In particular, generated code (as text) is typically intended for compiler consumption not human consumption.


Source code is considered the original: the master — what we edit & develop; what we archive using source code control.  Generated code, even when human-readable text, is typically regenerated from the original source code.  Generated code, generally speaking, doesn't have to be under source control since it is regenerated during build.

Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/69552/discussion-on-answer-by-erik-eidt-is-source-code-generation-an-anti-pattern). – maple_shaft Dec 01 '17 at 13:31
76

Practical reasoning

OK, I know that code is data as well. What I don't understand is, why generate source code?

From this edit, I assume you are asking on a rather practical level, not theoretical Computer Science.

The classical reason for generating source code in static languages like Java was that languages like that simply did not really come with easy to use in-language tools to do very dynamic stuff. For example, back in the formative days of Java, it simply was not possible to easily create a class with a dynamic name (matching a table name from a DB) and dynamic methods (matching attributes from that table) with dynamic data types (matching the types of said attributes). Especially since Java puts a whole deal of importance, nay, guarantees, on being able to catch type errors at compile time.

So, in such a setting, a programmer can only create Java code and write a lot of lines of code manually. Often, the programmer will find that whenever a table changes, he has to go back and change the code to match; and if he forgets that, bad things happen. Hence, the programmer will get to the point where he writes some tools that do it for him. And hence the road starts to ever more intelligent code generation.

(Yes, you could generate the bytecode on the fly, but programming such a thing in Java would not be something a random programmer would do just inbetween writing a few lines of domain code.)

Compare this to languages that are very dynamic, for example Ruby, which I would consider the antithesis to Java in most respects (note that I am saying this without valuing either approach; they are simply different). Here it is 100% normal and standard to dynamically generate classes, methods etc. at runtime, and most importantly, the programmer can do it trivially right in the code, without going on a "meta" level. Yes, things like Ruby on Rails come with code generation, but we found in our work that we basically use that as a kind of advanced "tutorial mode" for new programmers, but after a while it gets superfluous (as there is so little code to write in that ecosystem that when you know what you are doing, writing it manually gets faster than cleaning up the generated code).

These are just two practical examples from the "real world". Then you have languages like LISP where the code is data, literally. On the other hand, in compiled languages (without a runtime engine like Java or Ruby), there is (or was, I have not kept up with modern C++ features...) simply no concept of defining class or method names at runtime, so code generation the build process is the tool of choice for most things (other more C/C++ specific examples would be things like flex, yacc etc.).

AnoE
  • 5,614
  • 1
  • 13
  • 17
  • 1
    I think this is better than the more up-voted answers. In particular, the example mentioned with Java and database programming does a much better job of actually addressing why code generation is used and is a valid tool. – Panzercrisis Nov 29 '17 at 13:57
  • These days, is it possible in Java to create dynamic tables from a DB? Or only by using an ORM? – Noumenon Dec 01 '17 at 05:18
  • "(or was, I have not kept up with modern C++ features...)" surely this has been possible in C++ for over two decades thanks to function pointers? I haven't tested it but I'm sure it should be possibly to allocate a char array, fill it with machine code and then cast a pointer to the first element to a function pointer and then run it? (Assuming the target platform doesn't have some security measure to stop you doing that, which it might well do.) – Pharap Dec 02 '17 at 19:43
  • As soon as the question includes the word "pattern", you know it's not theoretical computer science! – David Richerby Dec 03 '17 at 00:27
  • @Pharap yes but that's back in the realm of "would not be something a random programmer would do just inbetween writing a few lines of domain code" – Caleth Aug 01 '18 at 15:35
  • @Caleth I don't understand your comment. Could you clarify what you mean? – Pharap Aug 02 '18 at 11:20
  • 1
    "allocate a char array, fill it with machine code and then cast a pointer to the first element to a function pointer and then run it?" Apart from being undefined behaviour, it's the C++ equivalent of "generate the bytecode on the fly". It falls into the same category of "not considered by ordinary programmers" – Caleth Aug 02 '18 at 11:22
  • 1
    @Pharap, "surely this has been possible in C++ for over two decades" ... I had to chuckle a little bit; it is about 2ish decades since I last coded C++. :) But my sentence about C++ was formulated badly anyways. I have changed it a bit, it should be clearer what I meant, now. – AnoE Aug 02 '18 at 11:28
46

why generate code?

Because programming with punch cards (or alt codes in notepad) is a pain.

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

True. I don't care about performance unless I'm forced to.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

Hmm, no idea what you're talking about.

Look it's like this: Generated and retained "source" code is always and forever a pain in the butt. It exists for one reason only. Someone wants to work in one language while someone else insists on working in another and neither one can be bothered to figure out how to interoperate between them so one of them figures out how to turn their favorite language into the imposed language so they can do what they want.

Which is fine until I have to maintain it. At which point you can all go die.

Is it an anti pattern? Sigh, no. Many languages wouldn't even exist if we weren't willing to say goodbye to the shortcomings of previous languages and generating the code of the older languages is how many new languages start.

It's a code base that is left in a half converted Frankenstein monster patchwork that I can't stand. Generated code is untouchable code. I hate looking at untouchable code. Yet people keep checking it in. WHY? You might as well be checking in the executable.

Well now I'm ranting. My point is we're all "generating code". It's when you treat generated code like source code that you're making me crazy. Just cause it looks like source code doesn't make it source code.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • By the way, by _why generate code_, I meant _why generate source code_. I will clarify this in the question. – Utku Nov 29 '17 at 05:00
  • 41
    If you generate it, it's not SOURCE code. It's intermediate code. I'm going to go cry now. – candied_orange Nov 29 '17 at 05:02
  • 1
    I know. I mean _generating textual code_. That is, code that is not meant to be consumed by a machine, but meant to be consumed by a human. – Utku Nov 29 '17 at 05:03
  • 66
    ARG!!! It doesn't matter what it looks like!!! Text, binary, DNA, if it's not the SOURCE it's not what you should touch when making changes. It's no ones business if my compilation process has 42 intermediate languages that it goes through. Stop touching them. Stop checking them in. Make your changes at the source. – candied_orange Nov 29 '17 at 05:07
  • 24
    XML is text and it plainly isn't meant for human consumption. :-) – Nick Keighley Nov 29 '17 at 09:25
  • 38
    @utku: "If something is not meant to be consumed by a human, it shouldn't be text": I completely disagree. Some counter-examples off the top of my head: the HTTP protocol, MIME encodings, PEM files -- pretty much anything that uses base64 anywhere. There are lots of reasons to encode data into a 7-bit safe stream even if no human should ever see it. Not to mention the much larger space of things that *normally* a human should never interact with, but that they may want to occasionally: log files, `/etc/` files on Unix, etc. – Daniel Pryden Nov 29 '17 at 14:03
  • 8
    @CandiedOrange: This answer is spot on. I was going to write an answer which also addressed the debugging aspect (if you need a human-readable form of the intermediate language on a regular basis, it means your debugging tools in your source language aren't good enough), but I realized I was just going into rant mode myself. When writing C code, 99.999% of the time you don't need to see the generated assembly, and exactly 0% of the time should you ever *check in* that generated assembly code into source control! As usual, "modern" toolchains have a lot to learn from toolchains of decades ago. – Daniel Pryden Nov 29 '17 at 14:10
  • 3
    @CandiedOrange You’d probably positively *hate* our code generator that generates boilerplate and repetitive parts of the source code; then real people insert “real” code in between (of course the generator knows how not to lose that manual code). Still, it’s an incredible productivity boost and a joy to work with. My point being: Things aren’t necessarily as black and white as your answer makes them appear. – besc Nov 29 '17 at 17:35
  • 12
    I don't think "programming with punch cards" means what you think it means. I've been there, I've done that, and yeah, it *was* a pain; but it has no connection to "generated code." A deck of punched cards is just another kind of _file_--like a file on disk, or a file on tape, or a file on an SD card. Back in the day, we would write data to decks of cards, and read data from them. So, if the reason we generate code is because programming with punch cards is a pain, then that implies that programming with _any kind of data storage_ is a pain. – Solomon Slow Nov 29 '17 at 17:46
  • 1
    One possible reason for checking in generated code into your source control system is so you can see the differences between different versions. Makes it easier to find out what changed. – Paŭlo Ebermann Nov 29 '17 at 23:07
  • @Utku - If it feels wrong or inefficient or unnecessary that intermediary code is human readable, that is a different problem, but that doesn't make it an anti-pattern. – OCDev Nov 29 '17 at 23:23
  • 1
    @ Utku - Is source code generation an anti-pattern? Yes. Generated code should never become the source, or else the generated source will get wiped out every time it is generated. There are limited exceptions. – OCDev Nov 29 '17 at 23:26
  • @Utku - Is human-readable code generation an anti-pattern? No. Transpiling into another language from a superior language does not have this inherent problem. As long as generated code never actually becomes source. – OCDev Nov 29 '17 at 23:26
  • 1
    @Utku - It is only an anti-pattern if the solution interferes with itself. – OCDev Nov 29 '17 at 23:26
  • @Utku - It may "feel" mis-purposed, but human-readable intermediary code is technically not an anti-pattern. There's not an actual technical problem happening just because it's human-readable. How easy it is to read symbols by humans is really a human bias, not a problem in the operational domain of the system itself. – OCDev Nov 29 '17 at 23:36
  • 2
    @DanielPryden: Any time you care about performance, it's not a bad idea to have a look at what the compiler did with your code in hot loops / functions, especially if you do know how to read asm (generally easier than *writing* asm). See Matt Godbolt's CppCon2017 talk: [“What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”](https://youtu.be/bSkpMdDe4g4). It's easy to write C that can't compile as well as you hoped, and small source changes (like changing the type of a loop counter, or using a different but equivalent expression) can let or help the compiler do a better job. – Peter Cordes Nov 30 '17 at 04:07
  • 6
    @Utku: gcc compiles C (and other languages) to actual asm text which it feeds to the assembler. This helped make gcc portable to platforms that already had a native Unix assembler which knew how to compile asm text into binary object files in platform-specific object file formats. Gcc still works this way, even though binutils (containing the GNU assembler) has been ported to pretty much every platform, too. Does anyone care that gcc goes through an ASCII representation of the code on the way to machine code, even though some other compilers emit machine code directly? Pretty much nobody. – Peter Cordes Nov 30 '17 at 04:16
  • 3
    @Utku Generating *textual* code has a big virtue over generating binary code: Textual code can be debugged by just looking at it. If you write code in one language, and cannot inspect the result of the translation into another language, you are reduced to pure behavioral analysis. This is especially a problem if the translation process itself may still contain bugs, which is virtually always the case unless you use such a well-known tool as `gcc`. Apart from that, CandiedOrange is perfectly right: *Never check in generated code, and never dare to modify it. It's not source code!* – cmaster - reinstate monica Nov 30 '17 at 13:06
  • @PeterCordes: I agree, performance and obscure hardware issues (which are really two sides of the same coin, IMO) are pretty much the canonical reasons why you would want to look at (but not modify!) the lower-level language code. That said, I still maintain that reading the assembly code should be something you do rarely: first off, you should measure your code and determine that there *is* a performance problem (too many people skip this step!) and even then you should look at algorithmic improvements (in the higher-level language) before dropping down into compiler-level optimizations. – Daniel Pryden Nov 30 '17 at 13:14
  • 3
    @besc: IMO, generating code "templates" for humans to fill in *is* an anti-pattern. Much better is to invert this, and let humans write "templates" for machines to fill in the blanks: for example, I think [Google's AutoValue project for Java](https://docs.google.com/presentation/d/14u_h-lMn7f1rXE1nDiLX0azS3IkgjGl5uxp5jGJ75RE/edit) is the right way to do boilerplate-reducing code generation. – Daniel Pryden Nov 30 '17 at 13:21
  • 1
    @jameslarge It is. –  Nov 30 '17 at 18:18
  • @DanielPryden: Right, step 1 is finding the hot-spots by profiling. Step 2 is checking if the compiler did anything silly that you can hand-hold it into doing better with minor source changes. But during the edit/compile cycle of step 2 ([if you know what you're doing with static analysis of asm](https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat)) you can often just look at the asm to see if you got the compiler to do what you wanted or not. Step 3: Sometimes you need to manually vectorize (with C++ SIMD intrinsics). – Peter Cordes Nov 30 '17 at 18:23
  • 1
    @PeterCordes: In retrospect, I'm sorry I used C and assembly as the example in my comment: I didn't mean that those are the only two languages where this concept applies, and in fact when you're close enough to the metal to care about assembly then you're really in a completely different world than the world that I understand the original question to actually be about. If I'm writing code in TypeScript and it compiles into JavaScript to run on web browsers, then I'm dealing with code generation, but I'm probably roughly never interested in what's happening at the machine language level. – Daniel Pryden Nov 30 '17 at 18:29
  • 2
    @nocomprende, What? Programming is a pain? I guess that means it's your day job, eh? Mine too. They don't pay us to do the easy ****. – Solomon Slow Nov 30 '17 at 18:32
  • 1
    @DanielPryden Mostly agreed. What we do is model the system as far as possible in a higher-level language. The usual problem is a bunch of wild, convoluted special cases baked into the spec. Getting the customer to change the spec only works sometimes. When we went looking for a language to model those specials efficiently we realized we already had one that fit the bill perfectly: C++, our implementation language. In a way it becomes a part of the higher-level model. Integrating it into the generated C++ code was the pragmatic solution to get full IDE support without lots of hassle. – besc Nov 30 '17 at 18:46
  • 2
    "I don't care about performance unless I'm forced to" seems to be the reason why a 8-core-CPU powered 16 GiB DDR4 computer on top of an SSD running Windows 10 feels just as slow as a 800 MHz Intel Celeron powered 64 MiB SDRAM computer on top of an HDD running Windows XP. From just that phrase, it seems like you employ premature pessimization. – phresnel Dec 01 '17 at 13:49
  • @DanielPryden: "for example, I think Google's AutoValue project for Java is the right way to do boilerplate-reducing code generation" - so, on slide 23 ("What you write"), I still have to write `public abstract class Foo` by hand because having something generate that for me (based upon some arbitrary project-speciifc data source that already has the information that I need a `Foo` class) is an anti-pattern? And I have to stop using project or file templates because they, too, are basically a (slightly parametrized) form of code skeleton generation? – O. R. Mapper Dec 04 '17 at 11:35
  • @ORMapper: Project or file templates are IMO a bit of a code smell, mainly because of how many times I've seen obvious boilerplate from templates (e.g. `Description of this class` comments) checked into production codebases. And yes, I think if you want a type `Foo` in your project, you should write a skeleton of type `Foo`: otherwise, where would you put your API documentation? (You *do* write Javadoc for all your value type objects, right?) – Daniel Pryden Dec 04 '17 at 14:58
45

why generate source code

The most frequent use case for code generators I had to work with in my career were generators which

  • took some high level meta-description for some kind of data model or database schema as input (maybe a relational schema, or some kind of XML schema)

  • and produced boiler-plate CRUD code for data access classes as output, and maybe additional things like corresponding SQLs or documentation.

The benefit here is that from one line of a short input specification you get 5 to 10 lines of debuggable, type-safe, bug-free (assumed the code generators output is mature) code you otherwise had to implement and maintain manually. You can imagine how much this reduces maintenance and evolvement effort.

Let me also respond to your initial question

Is source code generation an anti pattern

No, not source code generation per se, but there are indeed some pitfalls. As stated in The Pragmatic Programmer, one should avoid the usage of a code generator when it produces code which is hard to understand. Otherwise, the increased efforts to use or debug this code may easily outweigh the effort saved by not writing the code manually.

I would also like to add that it is most times a good idea to separate generated parts of code from manually written code physically in a way re-generation does not overwrite any manual changes. However, I also have dealt with the situation more than once where the task was to migrate some code written in old language X to another, more modern language Y, with the intention to to the maintenance afterwards in language Y. This is a valid use case for one-time code generation.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • I agree with this answer. Using something like Torque for java, I can do automatic generation of java source files, with fields matching the sql database. This makes crud operations much more easy. The major benefit is type safety, including only being able to reference fields which exists in the database(Thank you autocomplete). – MTilsted Nov 29 '17 at 12:34
  • Yes, for statically typed languages this is the important part: you can make sure your **hand-written** code actually fits to the generated one. – Paŭlo Ebermann Nov 29 '17 at 23:15
  • "migrate some code written in old language" - even then, the one-time code generation may be a big pain. For example, after some manual changes you detect a bug in the generator and need to redo the generation after the fix. Luckily, git or alike can usually ease the pain. – maaartinus Dec 01 '17 at 03:46
13

Pragmatic answer: is the code generation necessary and useful? Does it provide something that is genuinely very useful and needed for the proprietary codebase, or does it seem to just create another way of doing things in a way that contributes more intellectual overhead for sub-optimal results?

OK, I know that code is data as well. What I don't understand is, why generate code? Why not make it into a function which can accept parameters and act on them?

If you have to ask this question and there's no clear answer, then probably the code generation is superfluous and merely contributing exoticism and a great deal of intellectual overhead to your codebase.

Meanwhile if you take something like OpenShadingLanguage: https://github.com/imageworks/OpenShadingLanguage

... then such questions need not be raised since they are immediately answered by the impressive results.

OSL uses the LLVM compiler framework to translate shader networks into machine code on the fly (just in time, or "JIT"), and in the process heavily optimizes shaders and networks with full knowledge of the shader parameters and other runtime values that could not have been known when the shaders were compiled from source code. As a result, we are seeing our OSL shading networks execute 25% faster than the equivalent shaders hand-crafted in C! (That's how our old shaders worked in our renderer.)

In such a case you don't need to question the existence of the code generator. If you work in this type of VFX domain, then your immediate response is usually more on the lines of , "shut up and take my money!" or, "wow, we also need to make something like this."

marstato
  • 4,538
  • 2
  • 15
  • 30
  • _translate shader networks into machine code_. This sounds like a compiler rather than a code generator, no? – Utku Nov 29 '17 at 04:32
  • 2
    It basically takes a nodal network the user connects and generates intermediary code which is compiled JIT by LLVM. The distinction between compiler and code generator is kind of fuzzy. Were you thinking more on the lines of code generation features in languages like templates in C++ or the C preprocessor? –  Nov 29 '17 at 04:34
  • I was thinking of any generator that would output source code. – Utku Nov 29 '17 at 04:37
  • I see, where the output is still for human consumption I assume. OpenSL also generates intermediary source code but it's low-level code that's close to assembly for LLVM consumption. It's typically not code that's meant to be maintained (instead the programmers maintain the nodes used to generate the code). Most of the time I think those types of code generators are more likely to be abused than useful enough to justify their worth, especially if you have to constantly regenerate the code as part of your build process. Sometimes they still have a genuine place though to address shortcomings... –  Nov 29 '17 at 04:39
  • ... of the language(s) available when used for a particular domain. QT has one of those controversial ones with its meta-object compiler (MOC). The MOC reduces the boilerplate you would normally need to provide properties and reflection and signals and slots and so forth in C++, but not to such an extent to clearly justify its existence. I often think QT could have been better without the cumbersome burden of the MOC's code generation. –  Nov 29 '17 at 04:39
  • I've counted 5 "compilers" in Qt, not including the "normal" C++ one: C++ -> C++ (moc); xml -> C++ (uic); C++ -> xml (lupdate); xml -> resource (lrelease); resource -> C++ (qrc). There may be others I've missed – Caleth Aug 01 '18 at 15:54
13

why generate source code?

I've encountered two use cases for generated (at build time, and never checked in) code:

  1. Automatically generate boilerplate code such as getters/setters, toString, equals, and hashCode from a language built to specify such things (e.g. project lombok for Java)
  2. Automatically generate DTO type classes from some interface spec (REST, SOAP, whatever) to then be used in the main code. This is similar to your language bridge issue, but ends up being cleaner and simpler, with better type handling than trying to implement the same thing without generated classes.
Maybe_Factor
  • 1,381
  • 11
  • 12
  • 16
    Highly repetitive code in inexpressive languages. For instance I had to write code that essential did the same thing on many similar but not identical data structures. It probably could have done with something like a C++ template (hey isn't *that* code generation?). But I was using C. Code generation saved me writing lots of near identical code. – Nick Keighley Nov 29 '17 at 09:29
  • 1
    @NickKeighley Perhaps your toolchain was not permitting you to use another more suitable language? – Lorraine Nov 29 '17 at 10:23
  • 7
    You don't usually get to pick and choose your implementation language. The project was in C, that wasn't an option. – Nick Keighley Nov 29 '17 at 10:49
  • 1
    @Wilson the more expressive languages often use code generation (e.g. lisp macros, ruby on rails), they just don't require in to be saved as text in the meantime. – Pete Kirkham Nov 29 '17 at 10:53
  • 4
    Yeah, code-generation is essentially meta-programming. Languages like Ruby allow you to do meta-programming in the language itself, but C does not so you have to use code-generation instead. – Sean Burton Nov 29 '17 at 11:45
  • 1
    @SeanBurton C has macros, which is sort of like code generation. – Captain Man Nov 29 '17 at 13:19
  • @PeteKirkham Lisp macros don't really generate _code_, they generate S-expressions. Sure, you could dump these in text form and thus get Lisp code again, but that's not usually done because it would incur parsing overhead. (This is part of the reason why Lisps tend to be faster than other dynamic languages: the light list-based syntax can be held very efficiently in memory. Actually, I think e.g. Python also pre-compiles source file to a more efficient, non-text representation.) – leftaroundabout Nov 29 '17 at 20:48
  • @leftaroundabout that is what I said - they generate code without it being text. Plenty of code isn't text - machine code, byte code, cons cells (IME "s-expressions" normally refers to the text representation, but CLHS has both object and text for expression http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_e.htm#expression ). – Pete Kirkham Nov 30 '17 at 09:09
13

Sussmann had much interesting to say about such things in his classic "Structure and interpretation of computer programs", mainly about the code-data duality.

For me the major use of adhoc code generation is making use of an available compiler to convert some little domain specific language to something I can link into my programs. Think BNF, think ASN1 (Actually, don't, it is ugly), think data dictionary spreadsheets.

Trivial domain specific languages can be a huge time saver, and outputting something that can be compiled by standard language tools is the way to go when creating such things, which would you rather edit, a non trivial hand hacked parser in whatever native language you are writing, or the BNF for an auto generated one?

By outputting text that is then fed to some system compiler I get all of that compilers optimisation and system specific config without having to think about it.

I am effectively using the compiler input language as just another intermediate representation, what is the problem? Text files are not inherently source code, they can be an IR for a compiler, and if they happen to look like C or C++ or Java or whatever, who cares?

Now if you are hard of thinking you might edit the OUTPUT of the toy language parser, which will clearly disappoint the next time someone edits the input language files and rebuilds, the answer is to not commit the auto generated IR to the repo, have it generated by your toolchain (And avoid having such people in your dev group, they are usually happier working in marketing).

This is not so much a failure of expressiveness in our languages, as an expression of the fact that sometimes you can get (or massage) parts of the specification into a form that can be automatically converted into code, and that will usually beget far fewer bugs and be far easier to maintain. If I can give our test and configuration guys a spreadsheet they can tweak and a tool that they then run that takes that data and spits out a complete hex file for the flash on my ECU then that is a huge time saving over having someone manually translate the latest setup into a set of constants in language of the day (Complete with typos).

Same thing with building models in Simulink and then generating C with RTW then compiling to target with whatever tool makes sense, the intermediate C is unreadable, so what? The high level Matlab RTW stuff only needs to know a subset of C, and the C compiler takes care of the platform details. The only time a human has to grovel thru the generated C is when the RTW scripts have a bug, and that sort of thing is far easier to debug with a nominally human readable IR then with just a binary parse tree.

You can of course write such things to output bytecode or even executable code, but why would you do that? We got tools for converting an IR to those things.

Dan Mills
  • 642
  • 3
  • 8
  • This is good, but I'd add that there is a tradeoff when determining which IR to use: using C as an IR makes some things easier and other things harder, when compared to, say, x86 assembly language. The choice is even more significant when choosing between, say, Java language code and Java bytecode, as there are many more operations that only exist in one or the other language. – Daniel Pryden Nov 29 '17 at 14:21
  • 2
    But X86 assembly language makes a poor IR when targeting an ARM or PPC core! All things are a tradeoff in engineering, thats why they call it Engineering. One would hope that the possibilities of the Java bytecode were a strict superset of the possibilities of the Java language, and that this is generally true as you get closer to the metal irrespective of toolchain and where you inject the IR. – Dan Mills Nov 29 '17 at 14:29
  • Oh, I totally agree: my comment was in response to your final paragraph questioning why you'd ever output bytecode or some lower-level thing -- sometimes you do need the lower level. (In Java specifically, there are a lot of useful things you can do with bytecode that you can't do in the Java language itself.) – Daniel Pryden Nov 29 '17 at 16:38
  • 2
    I don't disagree, but there is a cost to using an IR closer to the metal, not only in reduced generality, but in the fact that you usually end up responsible for more of the really annoying low level optimisation. The fact that we generally these days think in terms of optimising algorithm choice rather then implementation is a reflection on just how far compilers have come, sometimes you have to go really close to the metal in these things, but think twice before throwing away the compilers ability to optimise by using too low level an IR. – Dan Mills Nov 29 '17 at 16:58
  • 1
    *"they are usually happier working in marketing"* Catty, but funny. – dmckee --- ex-moderator kitten Dec 01 '17 at 02:40
  • "Structure and organisation of computer programs": Did you mean "Structure and Interpretation of Computer Programs" (https://mitpress.mit.edu/sicp/full-text/book/book.html)? – Giorgio Dec 02 '17 at 10:17
  • Yea, "Structure and interpretation" indeed. My memory is clearly going. – Dan Mills Dec 02 '17 at 11:58
  • What is "IR"?.. – Nakilon Dec 04 '17 at 20:19
  • IR = Intermediate Representation, compiler geek speak for anything that lies between the stuff the human is supposed to edit and machine code. – Dan Mills Dec 04 '17 at 22:04
8

No, generating intermediate code is not an anti-pattern. The answer to the other part of your question, "Why do it?", is a very broad (and separate) question, though I will give some reasons anyway.

Historical ramifications of never having intermediate human-readable code

Let's take C and C++ as examples since they are among the most famous languages.

You should take notice that the logical procession of compiling C code outputs not machine code but rather human-readable assembly code. Likewise, old C++ compilers used to physically compile C++ code into C code. In that chain of events, you could compile from human readable code 1 to human readable code 2 to human readable code 3 to machine code. "Why?" Why not?

If intermediate, human-readable code was never generated, we might not even have C or C++ at all. That is certainly a possibility; people take the path of least resistance to their goals, and if some other language gained steam first because of C development stagnation, C might have died while it was still young. Of course, you could argue "But then maybe we would be using some other language, and maybe it would be better." Maybe, or maybe it would be worse. Or maybe we would all still be writing in assembly.

Why use intermediate human-readable code?

  1. Sometimes intermediate code is desired so that you can modify it before the next step in building. I will admit this point is the weakest.
  2. Sometimes it's because the original work was not done in any human-readable language at all but in a GUI modeling tool instead.
  3. Sometimes you need to do something very repetitive, and the language should not cater to what you are doing because it is such a niche thing or such a complicated thing that it has no business increasing the complexity or the grammar of the programming language just to accommodate you.
  4. Sometimes you need to do something very repetitive, and there is no possible way to get what you want into the language in a generic way; either it cannot be represented by or conflicts with the language's grammar.
  5. One of the goals of computers is to reduce human effort, and sometimes code that is unlikely to ever be touched again (low likelihood of maintenance) can have meta-code written to generate your longer code in a tenth the time; if I can do it in 1 day instead of 2 weeks and it's not likely to be maintained ever, then I better generate it - and on the off chance that someone 5 years from now is annoyed because they actually do need to maintain it, then they can spend the 2 weeks writing it out fully if they want to, or be annoyed by 1 week of maintaining the awkward code (but we are still 1 week ahead at that point), and that's if that maintenance needs to be done at all.
  6. I am sure there are more reasons I am overlooking.

Example

I have worked on projects before where code needs to be generated based on data or information in some other document. For example, one project had all of its network messages and constant data defined in a spreadsheet and a tool that would go through the spreadsheet and generate a lot of C++ and Java code that let us work with those messages.

I am not saying that was the best way to set up that project (I wasn't part of its startup), but that was what we had, and it was hundreds (maybe even thousands, not sure) of structures and objects and constants that were being generated; at that point it's probably too late to try to redo it in something like Rhapsody. But even if it were redone in something like Rhapsody, then we still have code generated from Rhapsody anyway.

Also, having all that data in a spreadsheet was good in one way: it allowed us to represent the data in ways we could not have if it were all just in source code files.

Example 2

When I did some work in compiler construction, I used the tool Antlr to do my lexing and parsing. I specified a language grammar, then I used the tool to spit out a ton of code in either C++ or Java, then I used that generated code along side my own code and included it in the build.

How else should that have been done? Perhaps you could come up with another way; there probably are other ways. But for that work, the other ways would have been no better than the generated lex/parse code I had.

Aaron
  • 231
  • 1
  • 5
  • Ive used intermediate code as a sort of a file format and debugging trace when the two systems were incompatible but had a stable api of some kind, in a very esoteric scripting language. Wasnt meant to be read manually but could have been in same way xml could have been. But this is more common than youd think after all webpages work thisway, as somebody pointed out. – joojaa Nov 30 '17 at 21:38
7

A bit of a more pragmatic answer, focusing on why and not on what is and isn't source code. Note that generating source code is a part of the build process in all of this cases - so the generated files shouldn't find their way into source control.

Interoprability/simplicity

Take Google's Protocol Buffers, a prime example: you write a single high level protocol description which can be then used to generate the implementation in multiple languages - often different parts of the system are written in different languages.

Implementation/technical reasons

Take TypeScript - browsers can't interpret it so the the build process uses a transpiler (code to code translator) to generate JavaScript. In fact many new or esoteric compiled languages start with transpiling to C before they get a proper compiler.

Ease of use

For embedded projects (think IoT) written in C and using only a single binary (RTOS or no OS) it is quite easy to generate a C array with the data to be compiled as if normal source code, as oposed to linking them in directly as resources.

Edit

Expanding on protobuf: code generation allows the generated objects to be first-class classes in any language. In a compiled language a generic parser would by necessity return a key-value structure - which means you nead a lot boilerplate code, you miss out on some compile-time checks (on keys and types of values in particular), get worse performance and no code completion. Imagine all those void* in C or that huge std::variant in C++ (if you have C++17), some languages may have no such feature at all.

jaskij
  • 575
  • 2
  • 8
  • For the first reason, I think the OP's idea would be to have a generic implementation in each language (which takes the protocol buffers description and then parses/consumes the on-the-wire format). Why would this be worse than generating code? – Paŭlo Ebermann Nov 29 '17 at 23:12
  • @PaŭloEbermann apart from the usual perfromance argument such a generic interpretation would make it impossible to use those messagess as first-class objects in compiled (and possibly interpreted) languages - in C++ for example such an interpreter would by necessity return a key-value structure. Of course you can then get that kv into your classes but it can turn into a lot of boilerplate code. And there is also code completion too. And compile time checking - your compiler won't check if your literals don't have typos. – jaskij Nov 29 '17 at 23:21
  • I agree ... could you add this into the answer? – Paŭlo Ebermann Nov 29 '17 at 23:25
  • @PaŭloEbermann done – jaskij Nov 29 '17 at 23:47
7

Is source code generation an anti pattern?

It's a work-around for an insufficiently expressive programming language. There is no need to generate code in a language that contains adequate built-in meta-programming.

kevin cline
  • 33,608
  • 3
  • 71
  • 142
  • 3
    It's also a workaround for having to write a full, down-to-native-object-code compiler for a more-expressive language. Generate C, let a compiler with a good optimizer take care of the rest. – Blrfl Nov 29 '17 at 18:01
  • Not always. Sometimes you have one or more databases containing some definitions for e.g. signals on a bus. Then you want to pull this information together, maybe do some consistency checks and then write code that interfaces between the signals coming from the bus and the variables you expect to have in your code. If you can show me a language that has meta-programming that makes it easy to use some client provided Excel sheets, a database and other data-sources and creates the code I need, with some necessary checks on data validity and consistency, then by all means show me. – CodeMonkey Nov 30 '17 at 06:25
  • @CodeMonkey: something like Ruby on Rails' ActiveRecord implementation comes to mind. There's no need to duplicate the database table schema in the code. Just map a class to a table and write business logic using the column names as properties. I can't imagine any sort of pattern that could be produced by a code generator that couldn't also be managed by Ruby meta-programming. C++ templates are also extremely powerful, albeit a bit arcane. Lisp macros are another powerful in-language meta-programming system. – kevin cline Dec 01 '17 at 08:11
  • @kevincline what I meant was code that was based on some data from the database (could be constructed from it), but not the database itself. I.e. I have information about which signals I receive in Excel Table A. I have a Database B with information on these signals, etc. Now I want to have a class that accesses these signals. There's no connection to the database or the Excel sheet on the machine that runs the code. Using really complicated C++ Templating to generate this code at compile time, instead of a simple code generator. I'll pick codegen. – CodeMonkey Dec 04 '17 at 11:43
7

What you're missing is reuse.

We have an amazing tool to turn source code text into binary, called a compiler. Its inputs are well-defined (usually!), and it has been through plenty of work to refine how it does optimisation. If you actually want to use the compiler to carry out some operations, you want to use an existing compiler and not write your own.

Plenty of people do invent new programming languages and write their own compilers. Pretty much without exception, they are all doing this because they enjoy the challenge, not because they need the features which that language provides. Everything which they do could be done in another language; they are simply creating a new language because they like those features. What that won't get them though is a well-tuned, fast, efficient, optimising compiler. It'll get them something which can turn text into binary, sure, but it will not be as good as all existing compilers.

Text is not just something which humans read and write. Computers are perfectly at home with text too. In fact formats like XML (and other related formats) are successful because they use plain text. Binary file formats are often obscure and poorly-documented, and a reader cannot easily find out how they work. XML is relatively self-documenting, making it easier for people to write code which uses XML-formatted files. And all programming languages are set up to read and write text files.

So, suppose you want to add some new facility to make your life easier. Perhaps it's a GUI layout tool. Perhaps it's the signals-and-slots interfaces which Qt provides. Perhaps it's the way that TI's Code Composer Studio lets you configure the device you're working with and pull the right libraries into the build. Perhaps it's taking a data dictionary and auto-generating typedefs and global variable definitions (yes, this is still very much a thing in embedded software). Whatever it is, the most efficient way to leverage your existing compiler is to create a tool which will take your configuration of whatever-it-is and automatically produce code in your language of choice.

It's easy to develop and easy to test, because you know what's going in and you can read the source code that it spits out. You don't need to spend man-years on building a compiler to rival GCC. You don't need to learn a complete new language, or require other people to. All you need to do is automate this one little area, and everything else stays the same. Job done.

Graham
  • 1,996
  • 1
  • 12
  • 11
  • Still the advantage of XML's text-basedness is just that _if necessary_, it can be read&written by humans (they don't normally bother once it works, but certainly do during development). In terms of performance and space-efficiency, binary formats are generally much better (which very often does not matter though, because the bottleneck is somewhere else). – leftaroundabout Nov 30 '17 at 09:49
  • @leftaroundabout If you need that performance and space-efficiency, sure. The reason many applications have gone to XML-based formats these days is that performance and space-efficiency are not the top criteria that they once were, and history has shown how poorly binary file formats are maintained. (Old MS Word documents for a classic example!) The point remains though - text is just as suited for computers to read as humans. – Graham Nov 30 '17 at 11:06
  • Sure, a badly-designed binary format may in effect actually perform worse than a properly thought through text format, and even a decent binary format is often not much more compact than XML packed with some general-purpose compression algorithm. IMO the best of both worlds is to use a human-readable specification through algebraic data types, and automatically generate an efficient binary representation from the AST of these types. See e.g. the [flat library](https://github.com/Quid2/flat). – leftaroundabout Nov 30 '17 at 11:23
6

Source code generation is not always an anti-pattern. For example, I am currently writing a framework which by given specification generates code in two different languages (Javascript and Java). The framework uses the generated Javascript to record browser actions of the user, and uses the Java code in Selenium to actually execute the action when the framework is in replay mode. If I did not use code generation, I would have to manually make sure that both are always in sync, which is cumbersome and also is a logical duplication in some way.

If however one is using source code generation for replacing features like generics, then it is anti-pattern.

  • You could, of course, write your code once in ECMAScript and run it in Nashorn or Rhino on the JVM. Or, you could write a JVM in ECMAScript (or try to compile Avian to WebAssembly using Emscripten) and run your Java code in the browser. I'm not saying those are great ideas (well, they are probably terrible ideas :-D ), but at least they are possible if not feasible. – Jörg W Mittag Nov 29 '17 at 09:15
  • In theory, it is possible, but it is not a general solution. What happens if I cannot run one of the languages inside another? For example, pne additional thing: I just created a simple Netlogo model using the code generation and have an interactive documentation of the system, which is always in sync with the recorder and the replayer. And in general, creating a requirement and then generating code keeps things which run semantically together in sync. – Hristo Vrigazov Nov 29 '17 at 09:36
6

Am I missing something here?

Maybe a good example where the intermediary code turned out to be the reason of success? I can offer you HTML.

I believe it was important for HTML to be simple and static - it made it easy to make browsers, it allowed to start mobile browsers early etc. As further experiments (Java applets, Flash) showed - more complex and powerful languages lead to more problems. It turns out that users actually are endangered by Java applets and visiting such websites was as safe as trying game cracks downloaded via DC++. Plain HTML, on the other hand, is harmless enough to allow us to check out any site with reasonable belief in security of our device.

However, HTML would be nowhere near where it is now if it wasn't computer generated. My answer wouldn't even show up on this page until someone manually rewrote it from the database into HTML file. Luckily you can make usable HTML in almost any programming language :)

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

Can you imagine better way to display the question and all of the answers and comments to user than by using HTML as a generated in-between code?

Džuris
  • 158
  • 10
  • Yes, I can imagine a better way. HTML is a legacy of a decision by Tim Berners-Lee to allow the quick creation of a text-only web browser. That was perfectly fine at the time, but we wouldn't do the same with the benefit of hindsight. CSS has made all the various presentation element types (DIV,SPAN,TABLE,UL,etc.) unnecessary. – kevin cline Dec 01 '17 at 08:25
  • @kevincline I am not saying that HTML as such is without flaws, I was pointing out that introducing markup language (that can be generated by a program) worked out very well in this case. – Džuris Dec 01 '17 at 23:31
  • So HTML+CSS is better than just HTML. I've even written internal documentation for some projects I've worked on directly in HTML+CSS+MathJax. But most web pages I visit seem to have been produced by code generators. – David K Dec 02 '17 at 16:04
4

why generate source code?

Because it's faster and easier (and less error-prone) than writing the code manually, especially for tedious, repetitive tasks. You can also use the high-level tool to verify and validate your design before writing a single line of code.

Common use cases:

  • Modeling tools like Rose or Visual Paradigm;
  • High-er level languages like Embedded SQL or an interface definition language that must be preprocessed into something compilable;
  • Lexer and parser generators like flex/bison;

As for your "why not just make it a function and pass parameters to it directly", note that none of the above are execution environments in and of themselves. There's no way to link your code against them.

John Bode
  • 10,826
  • 1
  • 31
  • 43
3

Generation of "source" code is an indication of a shortcoming of the language that are generated. Is using tools to overcome this an anti-pattern? Absolutely not - let me explain.

Typically code generation is used because there exist a higher-level definition that can describe the resulting code much less verbose than the lower level language. So code generation facilitates efficiency and terseness.

When I write c++, I do so because it allows me to write code more efficient than using assembler or machine code. Still machine code is generated by the compiler. In the beginning, c++ was simply a preprocessor that generated C code. General purpose languages is great for generating general purpose behavior.

In the same way, by using a DSL (domain specific language) it is possible to write terse, but perhaps code constricted to a specific task. This will make it less complicated to generate the correct behavior of the code. Remember that code is means to and end. What a developer is looking for is an efficient way to generate behavior.

Ideally the generator can create fast code from an input that is simpler to manipulate and understand. If this is fulfilled not using a generator is an anti-pattern. This anti-pattern typically comes from the notion that "pure" code is "cleaner", much in the same way a wood worker or other artisan might look at use of power tools, or use of CNC to "generate" workpieces (think golden hammer).

On the other hand, if the source of the generated code is harder to maintain or generate code that is not efficient enough the user is falling into the trap of using the wrong tools (sometime because of the same golden hammer).

daramarak
  • 428
  • 4
  • 10
2

Sometimes, your programming language just doesn't have the facilities you want, making it actually impossible to write functions or macros to do what you want. Or maybe you could do what you want, but the code to write it would be ugly. A simple Python script (or similar) can then generate the required code as part of your build process, which you then #include into the actual source file.

How do I know this? Because it's a solution I've reached for multiple times when working with various different systems, most recently SourcePawn. A simple Python script that parses a simple line of source code and produces two or three lines of generated code is far better than manually crafting the generated code, when you end up with two dozen such lines (creating all my cvars).

Demonstrative/example source code available if people want it.

rosuav
  • 135
  • 3
1

Text form is required for easy consumption by humans. Computers also process code in text form quite easily. Therefore generated code should be generated in the form that is easiest to generate and easiest to consume by computers, and that is very often readable text.

And when you generate code, the code generation process itself often needs to be debugged - by humans. It's very, very useful if the generated code is human readable so humans can detect problems in the code generation process. Someone has to write the code to generate code, after all. It doesn't happen out of thin air.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
1

Generating Code, just once

Not all source code generation is a case of generating some code, and then never touching it; then regenerating it from the original source when it needs updating.

Sometimes you generate code just once, and then discard the original source, and moving forward maintain the new source.

This sometimes happens when porting code from one language to another. Particularly if one doesn't expect to want to later port over new changes in the original (e.g. old language code is not going to be maintained, or it is actually complete (e.g. in the case of some math functionality)).

One common case is that writing a code generator to do this, might only actually translate 90% of the code correctly. and then that last 10% needs to be fixed up by hand. Which is a lot faster than translating 100% by hand.

Such code generators are often very different to the kind of code generators full language translators (like Cython or f2c) produce. Since the goal is to make maintain code once. They are often made as a 1 off, to do exactly what they have to. In many ways it is the next level version of using a regex/find-replace to port code. "Tool assisted porting" you could say.

Generating Code, just once, from e.g. a website scrape.

Closely related is if you generate the code from some source you don't want to accesses again. E.g. If the actions needed to generate the code are not repeatable, or consistent, or performing them is expensive. I am working on a pair of projects right now: DataDeps.jl and DataDepsGenerators.jl.

DataDeps.jl helps users download data (like standard ML datasets). To do this it needs what we call a RegistrationBlock. That is some code specifying some metadata, like where to download the files from, and a checksum, and a message explaining to the user any terms/coditions/what the licensing status on the data is.

Writing those blocks can be annoying. And that information is often available in (structured or unstructured) froms on the websites where the data is hosted. So DataDepsGenerators.jl, uses a webscraper to generate the RegistrationBlockCode, for some sites that host a lot of data.

It might not generate them correctly. So the dev using the generated code can and should check and correct it. Odds are they want to make sure it hasn't miss-scraped the licensing information for example.

Importantly, users/devs working with DataDeps.jl do not need to install or use the webscraper to use the RegistrationBlock code that was generated. (And not needing to download and install a web-scraper saves a a fair bit of time. particularly for the CI runs)

Generating source code once is no an antipattern. and it normally can not be replaced with metaprogramming.

  • "report" is an English word that means something other than "port again". Try "re-port" to make that sentence clearer. (Commenting because too small for a suggested edit.) – Peter Cordes Nov 30 '17 at 04:38
  • Good catch @PeterCordes I have rephrased. – Frames Catherine White Nov 30 '17 at 04:46
  • Faster but potentially *much* less maintainable, depending on how horrible the generated code is. Fortran to C was a thing back in the day (C compilers were more widely available, so people would use `f2c` + `cc`), but the resulting code was not really a good starting point for a C version of the program, AFAIK. – Peter Cordes Nov 30 '17 at 04:49
  • 1
    Potentially, potentially not. It is not the fault in the concept of code generators that some code generators make non-maintainable code. In particular, a hand crafted tool, that doesn't have to catch every case can often make perfectly nice code. If 90% of the code is just list of array constants for example then generating those arrays constructors as an one off can trivially be done very nicely, and low effort. (On the other hand the C code output by Cython can't be maintained by humans. Because it is not intended to be. Just like you say for `f2c` back in the day) – Frames Catherine White Nov 30 '17 at 04:56
  • Ok, you're talking about tools that try to make readable but possibly wrong code. (Maybe make that point more clearly in your answer? I imagine a lot of people would think of translation tools like the ones I thought of, which don't produce maintainable code). But anyway, that seems risky, too; very easy to have a subtle bug for some corner case that's hard to catch and not obvious from reading through the code because it probably "looks right". – Peter Cordes Nov 30 '17 at 05:01
  • Editted to highlight that, thanks. Again, depends on the circumstances as to if it is too risky. A human porting the code by hand also can make non-obvious mistakes. "Oops, accidentally deleted an extra digit in this table. Now anyone calculating `sind(x)` where x>10000 and its modulo 360 is between 90 and 180 degrees is wrong" When tests are important, then tests are important. – Frames Catherine White Nov 30 '17 at 05:12
  • Using a tool to re-format a big table into array-initializer syntax for another language doesn't really counts as porting code, IMO. Of course you should use a tool (or use `sed` or a regex search/replace in your editor to do the bulk of it and clean up the head/tail). Anyway, the edit is an improvement, but I'm not totally convinced. (To be honest, I don't have time to give this answer a careful read to see if the last part ends up saying something more about logic, not just data mixed with program logic, because I'm not familiar with the examples you give.) – Peter Cordes Nov 30 '17 at 05:26
  • 1
    The big table was just the simplest most reduced argument. Similar can be said for say converting for-loops or conditions. Indeed `sed` goes a long way, but sometimes one needs a bit more expressive power. The line between program logic and data is often a fine one. Sometimes the distinction isn't useful. JSON is (/was) just javascript object constructor code. In my example I am also generating object constructor code (is it data? maybe (maybe not since sometimes it has function calls). Is it better treated as code? yes.) – Frames Catherine White Nov 30 '17 at 05:42
1

There are a few different ways of using code generation. They could be divided in three major groups:

  • Generating code in a different language as output from a step in the compilation process. For the typical compiler this would be a lower-level language, but it could be to another high-level language as in the case of the languages which compile to JavaScript.
  • Generating or transforming code in the source code language as a step in the compilation process. This is what macros does.
  • Generating code with a tool separately from the regular compilation process. The output from this is code which lives as files together with the regular source code and is compiled along with it. For example entity classes for an ORM might be auto-generated from a database schema, or data transfer objects and service interfaces might be generated from an interface specification like a WSDL file for SOAP.

I would guess you are talking about the third kind of generated code, since this is the most controversial form. In the first two forms the generated code is an intermediate step which is very cleanly separated from the source code. But in the third form there is no formal separation between source code and generated code, except the generated code probably have a comment which say "don't edit this code". It stills opens the risk of developers editing the generated code which would be really ugly. From the viewpoint of the compiler, the generated code is source code.

Nevertheless, such forms of generated code can be really useful in a statically typed language. For example when integration with ORM entities, it is really useful to have strongly-typed wrappers for the database tables. Sure you could handle the integration dynamically at runtime, but you would lose type safety and tool support (code completion). A major benefit of statically type language is the support of the type system at the type of writing rather than just at runtime. (Conversely, this type of code generation is not very prevalent in dynamically typed languages, since in such a language it provides no benefit compared to runtime conversions.)

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

Because type safety and code completion are features you want at compile time (and while writing code in an IDE), but regular functions are only executed at runtime.

There might be a middle ground though: F# supports the concept of type providers which is basically strongly typed interfaces generated programmatically at compile time. This concept could probably replace many uses of code generation, and provide a cleaner separation of concerns.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
0

Source code generation absolutely does mean the the generated code is data. But it is first class data, data that the rest of the program can manipulate.

The two most common types of data that I am aware of that are integrated into source code is graphical information about windows (number and placement of various controls), and ORMs. In both cases the integration via code generation makes manipulating the data easier, because you don't have to go through extra "special" steps to use them.

When working with the original (1984) Macs, dialog and window definitions were created using a resouce editor that kept the data in a binary format. Using these resources in your application was harder than it would have been if the "binary format" had been Pascal.

So, no, source code generation is not an anti-pattern, it allows making the data part of the application, which makes it easier use.

jmoreno
  • 10,640
  • 1
  • 31
  • 48
0

Code and data both are: Information.

Data is the information exactly in the form you need (and value). Code is also information, but in an indirect or intermediate form. In essence, code is also a form of data.

More specifically, code is information for machines to offload humans from processing information all by themselves.

Offloading humans from information processing is the most important motive. Intermediate steps are acceptable as long as they make life easy. That's why intermediate information mapping tools exist. Like code generators, compilers, transpilers, etc.

why generate source code? Why not make it into a function which can accept parameters and act on them?

Let's say someone offers you such a mapping function, whose implementation is obscure to you. As long as the function works as promised, would you care if internally it's generating source code or not?

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
S.D.
  • 957
  • 6
  • 16
0

Code generation is an anti-pattern when it costs more than it accomplishes. This situation occurs when generation takes place from A to B where A is almost the same language as B, but with some minor extensions that could be done just by coding in A with less effort than all the custom tooling and build staging for A to B.

The trade off is more prohibitive against code generation in languages that don't have meta-programming facilities (structural macros) because of the complications and inadequacies of achieving metaprogramming through the staging of external text processing.

The poor trade off could also have to do with the quantity of use. Language A could be substantially different from B, but the whole project with its custom code generator only uses A in one or two small places, so that the total amount of complexity (small bits of A, plus the A -> B code generator, plus the surrounding build staging) exceeds the complexity of a solution just done in B.

Basically, if we commit to code generation, we should probably "go big or go home": make it have substantial semantics, and use it a lot, or don't bother.

Kaz
  • 3,572
  • 1
  • 19
  • 30
  • Why did you remove the "When Bjarne Stroustrup first implemented C++ ..." paragraph? I think it was interesting. – Utku Nov 30 '17 at 18:22
  • @Utku Other answers cover this from the point of view of compiling an entire, sophisticated language, in which the rest of a project is entirely written. I don't think it's representative of the majority of what is called "code generation". – Kaz Dec 01 '17 at 16:36
0

I didn't see this stated clearly (I did see it touched upon by one or two answers, but it didn't seem very clear)

Generating code (as you said, as though it was data) is not a problem--it's a way to reuse a compiler for a secondary purpose.

Editing generated code is one of the most insidious, evil, horrific anti-patterns you will ever come across. Do not do this.

At best, editing generated code pulls a bunch of poor code into your project (the ENTIRE set of code is now truly SOURCE CODE--no longer data). At worst the code pulled into your program is highly redundant, poorly named garbage that is nearly completely unmaintainable.

I suppose a third category is code you use once (gui generator?) then edit to help you get started/learn. This is a little of each--it CAN be a good way to start but your GUI generator will be targeted at using "Generatable" code that won't be a great start for you as a programmer--In addition, you might be tempted to use it again for a second GUI which means pulling redundant SOURCE code into your system.

If your tooling is smart enough to disallow any edits whatsoever of generated code, go for it. If not, I'd call it one of the worst anti-patterns out there.

Bill K
  • 2,699
  • 18
  • 18
0

If something can be generated, then that thing is data, not code.

Inasmuch as you stipulate later on that code is data, your proposition reduces to "If something can be generated, then that thing is not code." Would you say, then, that assembly code generated by a C compiler is not code? What if it happens to coincide exactly with assembly code that I write by hand? You're welcome to go there if you wish, but I won't be coming with you.

Let's start instead with a definition of "code". Without getting too technical, a pretty good definition for the purposes of this discussion would be "machine-actionable instructions for performing a computation."

Given that, isn't this whole idea of source code generation a misunderstanding?

Well yes, your starting proposition is that code cannot be generated, but I reject that proposition. If you accept my definition of "code" then there should be no conceptual problem with code generation in general.

That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

Well that's an entirely different question, about the reason for employing code generation, rather than about its nature. You are proposing the alternative that instead of writing or using a code generator, one writes a function that computes the result directly. But in what language? Gone are the days when anyone wrote directly in machine code, and if you write your code in any other language then you depend on a code generator in the form of a compiler and / or assembler to produce a program that actually runs.

Why, then, do you prefer to write in Java or C or Lisp or whatever? Even assembler? I assert that it's at least in part because those languages provide abstractions for data and operations that make it easier to express the details of the computation you want to perform.

The same is true of most higher-level code generators, too. The prototypical cases are probably scanner and parser generators such as lex and yacc. Yes, you could write a scanner and a parser directly in C or in some other programming language of your choice (even raw machine code), and sometimes one does. But for a problem of any significant complexity, using a higher-level, special-purpose language such as lex's or yacc's makes the hand-written code easier to write, read, and maintain. Usually much smaller, too.

You should also consider what exactly you mean by "code generator". I would consider C preprocessing and the instantiation of C++ templates to be exercises in code generation; do you object to these? If not, then I think you'll need to perform some mental gymnastics to rationalize accepting those but rejecting other flavors of code generation.

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

Why? You are basically positing that one should have a universal program to which the user feeds data, some classified as "instructions" and others as "input", and which proceeds to perform the computation and emit more data that we call "output". (From a certain point of view, one might call such a universal program an "operating system".) But why do you suppose that a compiler should be as effective at optimizing such a general-purpose program as it is at optimizing a more specialized program? The two programs have different characteristics and different capabilities.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

You say that as if having a universal-to-some-degree interface library would necessarily be a good thing. Perhaps it would, but in many cases such a library would be big and difficult to write and maintain, and maybe even slow. And if such a beast in fact does not exist to serve the particular problem at hand, then who are you to insist that one be created, when a code generation approach can solve the problem much more quickly and easily?

Am I missing something here?

Several things, I think.

I know that code is data as well. What I don't understand is, why generate source code? Why not make it into a function which can accept parameters and act on them?

Code generators transform code written in one language to code in a different, usually lower-level language. You're asking, then, why people would want to write programs using multiple languages, and especially why they might want to mix languages of subjectively different levels.

But I touched on that already. One chooses a language for a particular task based in part on its clarity and expressiveness for that task. Inasmuch as smaller code has fewer bugs on average and is easier to maintain, there is also a bias toward higher-level languages, at least for large-scale work. But a complex program involves many tasks, and often some of them can be more effectively addressed in one language, whereas others are more effectively or more concisely addressed in another. Using the right tool for the job sometimes means employing code generation.

John Bollinger
  • 941
  • 5
  • 11
0

Answering the question within the context of your comment:

The compiler's duty is to take a code written in human-readable form and convert it to machine-readable form. Hence, if the compiler cannot create a code that is efficient, then the compiler is not doing its job properly. Is that wrong?

A compiler will never be optimized for your task. The reason for that is simple: it's optimized to do many tasks. It's a general purpose tool used by many people for many different tasks. Once you know what your task is, you can approach the code in a domain-specific manner, making tradeoffs that the compilers could not.

As an example, I've worked on software where an analyst may need to write some code. They could write their algorithm in C++, and add in all the bounds checks and memoization tricks that they depend on, but that requires knowing a lot about the inner workings of the code. They would rather write something simple, and let me throw an algorithm at it to generate the final C++ code. Then I can do exotic tricks to maximize performance like static analysis that I would never expect my analysts to endure. Code generation allows them to write in a domain-specific manner which lets them get the product out the door easier than any general purpose tool could.

I have also done the exact opposite. I have another piece of work that I've done which had a mandate "no code generation." We still wanted to make life easy on those using the software, so we used massive amounts of template metaprogramming to make the compiler generate the code on the fly. Thus, I only needed the general purpose C++ language to do my job.

However, there's a catch. It was tremendously difficult to guarantee that the errors were readable. If you've ever used template metaprogrammed code before, you know that a single innocent mistake can generate an error that takes 100 lines of incomprehensible class names and template arguments to understand what went wrong. This effect was so pronounced that the recommended debugging process for syntax errors was "Scroll through the error log until you see the first time one of your own files has an error. Go to that line, and just squint at it until you realize what you did wrong."

Had we used code generation, we could have had much more powerful error handling capabilities, with human readable errors. C'est la vie.

Cort Ammon
  • 10,840
  • 3
  • 23
  • 32
0

Processor instruction sets are fundamentally imperative, but programming languages can be declarative. Running a program written in a declarative language inevitably requires some type of code generation. As mentioned in this answer and others, a major reason for generating source code in a human-readable language is to take advantage of the sophisticated optimizations performed by compilers.

Kevin Krumwiede
  • 2,586
  • 1
  • 15
  • 19
0

Not at All

One of the most widely used uses of code generation is Google's Protocol Buffers. Protobufs are a serialization tool that microservices (or any other application of serialization) can use to communicate with each other. What Google engineers found over time was that it was costly to write serialization protocols by hand, very error-prone, and hard to extend. To combat this, they created a library that allows you to describe your messages, and then simply generate the code that reads and writes it.

Given that, isn't this whole idea of source code generation a misunderstanding? That is, if there is a code generator for something, then why not make that something a proper function which can receive the required parameters and do the right action that the "would generated" code would have done?

The Python implementation of the Protobuf spec actually does this sort of thing, but the only reason this is at all feasible is because Python has a lot of runtime metaprogramming facilities that just aren't possible in a static compiled language. For instance, Python allows you to create types at runtime, which are easily usable due to duck typing.

For a statically compiled language, you don't have the ability to create types at runtime, so it's often much easier to work with and more flexible if the types are generated ahead of time instead of forcing you to work with a more generic interface (e.g. JSON)

If it is being done for performance reasons, then that sounds like a shortcoming of the compiler.

It is not possible to write a compiler smart enough to turn an interpreted template into fine-tuned machine code (in the general case, which is what you're essentially asking for), thanks to Rice's Theorem. You need to write a specialized code generator to accomplish this.

If it is being done to bridge two languages, then that sounds like a lack of interface library.

In the case of Protobuf, the serialization it provides is the interface library.

Can you argue that code generation is a workaround for limitations in a programming language? Absolutely. Serialization is much easier when your language supports reflection. But even when it does, it can be useful to generate code that conforms to a specific protocol used across your organization, as is the use case for Protobufs.


Outside of that, I've found code generation to be extremely useful when working with C and C++. Both have very poor support for metaprogramming, so I've found it to be much simpler to generate C/C++ code with a Python script. Though many things are possible with C++ templates, doing so comes at the cost of greatly inflating compile time and cryptic error messages. It's both faster to compile and easier to understand to write a code generator.

Beefster
  • 225
  • 1
  • 6