40

So, in my efforts to write a program to conjugate verbs (algorithmically, not through a dataset) for French, I've come across a slight problem.

The algorithm to conjugate the verbs is actually fairly simple for the 17-or-so cases of verbs, and runs on a particular pattern for each case; thus, the conjugation suffixes for these 17 classes are static and will (very likely) not change any time soon. For example:

// Verbs #1 : (model: "chanter")
    terminations = {
        ind_imp: ["ais", "ais", "ait", "ions", "iez", "aient"],
        ind_pre: ["e", "es", "e", "ons", "ez", "ent"],
        ind_fut: ["erai", "eras", "era", "erons", "erez", "eront"],
        participle: ["é", "ant"]
    };

These are inflectional suffixes for the most common class of verb in French.

There are other classes of verbs (irregulars), whose conjugations will also very likely remain static for the next century or two. Since they're irregular, their complete conjugations must be statically included, because they can't reliably be conjugated from a pattern (there are also only [by my count] 32 irregulars). For example:

// "être":
    forms = {
        ind_imp: ["étais", "étais", "était", "étions", "étiez", "étaient"],
        ind_pre: ["suis", "es", "est", "sommes", "êtes", "sont"],
        ind_fut: ["serai", "seras", "sera", "serons", "serez", "seront"],
        participle: ["été", "étant"]
    };

I could put all this into XML or even JSON and deserialize it when it needs to be used, but is there a point? These strings are part of natural language, which does change, but at a slow rate.

My concern is that by doing things the "right" way and deserializing some data source, I've not only complicated the problem which doesn't need to be complicated, but I've also completely back-tracked on the whole goal of the algorithmic approach: to not use a data source! In C#, I could just create a class under namespace Verb.Conjugation (e.g. class Irregular) to house these strings in an enumerated type or something, instead of stuffing them into XML and creating a class IrregularVerbDeserializer.

So the question: is it appropriate to hard-code strings that are very unlikely to change during the lifetime of an application? Of course I can't guarantee 100% that they won't change, but the risk vs cost is almost trivial to weigh in my eyes - hardcoding is the better idea here.

Edit: The proposed duplicate asks how to store a large number of static strings, while my question is when should I hard-code these static strings.

Chris Cirefice
  • 2,984
  • 4
  • 17
  • 37
  • 26
    Might you want to use this software for a different language other than French in the future? –  Jun 01 '15 at 19:48
  • @Snowman Of course, I imagine that I would also implement it for Japanese as I learn more about the language, probably Russian as well. Of course these languages don't necessarily have the same grammar/conjugation patterns. That said, each language would have its own namespace in the library to handle such things. I don't know enough about Japanese or Russian to know all of the types of conjugation patterns/exceptions; this is just for French so far, and the fact that there are (relatively) few strings to hard-code for natural language makes me think that it's not a negative thing at all. – Chris Cirefice Jun 01 '15 at 19:52
  • 10
    Algorithmic approach or not, it's clear that you simply have to hardcode these 32*20 strings (and more when you add more languages), and the only real question is where to put them. I would pick wherever feels most convenient to you, which sounds like it'd be in code for the time being. You can always shuffle them around later. – Ixrec Jun 01 '15 at 20:08
  • @Ixrec True; I was thinking of just putting them into a data class under `French.Verb.Irregular` and `French.Verb.Conjugation.TerminationPatterns`. Perhaps I need to learn more about the language, but as this is not even close to enterprise software and would be maintained only by me, I don't think it's that big of an issue (yet). – Chris Cirefice Jun 01 '15 at 20:11
  • 1
    @ChrisCirefice That sounds pretty much optimal to me. Go for it. – Ixrec Jun 01 '15 at 20:20
  • How do you plan to handle things like "commencer" or "manger"? If I recall, "vous commenciez" and "vous mangiez", but "je commençais" and "je mangeais". Do you consider those "irregular" or do you have special rules for stems ending in "c" or "g"? – supercat Jun 01 '15 at 21:05
  • @supercat *commencer* falls under the same category of verbs as *placer*, which have the `ç` that you're talking about (these are class #9 in my list). *manger* falls under class #10. These are 2 of the 17 patterned types whose terminations I'm talking about hard-coding. Basically, the stem is found through taking the substring from some ending: *commencer* --> remove the "cer" ending and replace with the terminations from whatever verb mode/tense is necessary. *manger* --> remove the "er" ending and do the same thing :) – Chris Cirefice Jun 01 '15 at 21:10
  • possible duplicate of [Practical way to store a "reasonably large" amount of data that hardly ever changes?](http://programmers.stackexchange.com/questions/112817/practical-way-to-store-a-reasonably-large-amount-of-data-that-hardly-ever-chan) – gnat Jun 01 '15 at 21:56
  • see also: [Is it ever a good idea to hardcode values into our applications?](http://programmers.stackexchange.com/questions/67982/is-it-ever-a-good-idea-to-hardcode-values-into-our-applications) – gnat Jun 01 '15 at 21:56
  • @dyesdyes, the OP explicitly mentions irregular verbs as a separate issue in the question. You are completely missing the point of the question! – Vicky Jun 02 '15 at 09:19
  • _"...will never change."_ This amuses me greatly. Code is transient. Always assume it will change. – Gusdor Jun 03 '15 at 09:19
  • 2
    @Gusdor I don' think you read clearly - I said the *conjugation patterns* will likely never change, or change so infrequently that a recompile every 100 years or so would be fine. Of course the code will change, but once the strings are in there how I want them, unless I'm refactoring they'll be static for the next 100 years. – Chris Cirefice Jun 03 '15 at 14:19
  • 1
    +1. Not to mention that in 60-100 years the cost will either not exist, or have been replaced by a better version altogether. – HarryCBurn Jun 03 '15 at 14:28
  • Just a fact about French, even with the irregular verb an analysis with multiple stems for verbs + regular suffixes is possible for most cases (like the imperfect, for example). I don't know if it's simpler to just put in all forms, but there is in fact a way to conjugate them by putting together a smaller number of parts: example "etre" has the imperfect stem "ét" + the regular imperfect endings. https://en.wikipedia.org/wiki/French_conjugation – sumelic Jun 03 '15 at 15:17
  • @sumelic That may be true, and I've widdled it down as far as I could to 17 regular patterns and 32 irregular verbs. Some of the irregular verbs share certain stems, and certain suffixes, but it would over-complicate things to try to write a pattern rule for those few exceptional cases in the irregular verbs instead of just explicitly typing out their full table once and only once ;) – Chris Cirefice Jun 03 '15 at 15:23

6 Answers6

57

is it appropriate to hard-code strings that are very unlikely to change during the lifetime of an application? Of course I can't guarantee 100% that they won't change, but the risk vs cost is almost trivial to weigh in my eyes - hardcoding is the better idea here

It looks to me that you answered your own question.

One of the biggest challenges we face is to separate out the stuff that's likely to change from the stuff that won't change. Some people go nuts and dump absolutely everything they can into a config file. Others go to the other extreme and require a recompile for even the most obvious changes.

I'd go with the easiest approach to implement until I found a compelling reason to make it more complicated.

Dan Pichelman
  • 13,773
  • 8
  • 42
  • 73
  • Thanks Dan, that's kind of what I figured. Writing an XML schema for this, having another file to keep track of, and having to write an interface to deserialize the data just seemed like overkill considering there just aren't that many strings, and because it's natural language, it's unlikely to change drastically in the next 100 years. Fortunately, nowadays in programming languages we have fancy ways to abstract this raw data behind a nice-looking interface, for example `French.Verb.Irregular.Etre` which would contain the data from my question. I think it works out alright ;) – Chris Cirefice Jun 01 '15 at 20:05
  • 3
    +1 Here from the Ruby camp, I would start out hardcoding stuff and moving it to config as necessary. Don't prematurely over-engineer your project by making things configurable. It just slows you down. – Overbryd Jun 02 '15 at 11:32
  • 2
    Note: some groups have a different definition of "hardcoding," so be aware that that term means multiple things. There is a well recognized anti-pattern where you hard code values into the statements of a function, rather than creating data structures as you have (`if (num == 0xFFD8)`). That example should become something like `if (num == JPEG_MAGIC_NUMBER)` in almost all cases for readability reasons. I just point it out because the word "hardcoding" often raises hairs on people's necks (like mine) because of this alternate meaning of the word. – Cort Ammon Jun 02 '15 at 15:43
  • @CortAmmon JPEG has lots of magic numbers. Surely `JPEG_START_OF_IMAGE_MARKER`? – user253751 Jun 03 '15 at 06:22
  • @immibis Your choice of constant naming is probably better than mine. – Cort Ammon Jun 03 '15 at 14:58
  • Perhaps you could put the strings in an XML file, and *generate* the code from that. – Ross Presser Jun 05 '15 at 19:00
25

You are reasoning at the wrong scope.

You haven't hardcoded only individual verbs. You have hardcoded the language and its rules. This, in turn, means that your application cannot be used for any other language, and cannot be extended with other rules.

If this is your intent (i.e. using it for French only), this is the right approach, because of YAGNI. But you admit yourself that you want to use it later for other languages as well, which will mean that very soon, you'll have to move all the hardcoded part to the configuration files anyway. The remaining question is:

  • Will you, with a certainty close to 100%, in the near future, extend the app to other languages? If so, you should have been exporting things to JSON or XML files (for words, parts of the words, etc.) and dynamic languages (for rules) right now instead of forcing yourself to rewrite the major part of your app.

  • Or there is only a minor probability that the app will be extended somewhere in the future, in which case YAGNI dictates that the simplest approach (the one you're using right now) is the better?

As an illustration, take Microsoft Word's spelling checker. How many things do you think are hardcoded?

If you are developing a text processor, you could start by a simple spelling engine with hardcoded rules and even hardcoded words: if word == "musik": suggestSpelling("music");. Rapidly, you'll start moving words, then rules themselves outside your code. Otherwise:

  • Every time you have to add a word, you have to recompile.
  • If you learned a new rule, you have to change the source code once again.
  • And more importantly, there is no way you can adapt the engine to German or Japanese without writing tremendous amounts of code.

As you highlighted yourself:

Very few rules from French could be applied to Japanese.

As soon as you hardcode the rules for a language, every other one will require more and more code, especially given the complexity of natural languages.

Another subject is how you express those different rules, if not through code. Ultimately, you may find that a programming language is the best tool for that. In that case, if you need to extend the engine without recompiling it, dynamic languages may be a good alternative.

Nathan Tuggy
  • 345
  • 1
  • 6
  • 14
Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513
  • I'm not so sure that this is a problem as you describe it; I plan to namespace each language separately. Unfortunately natural languages are things that vary widely in their structure, grammar, etc., so very few rules from French could be applied to Japanese. Maybe there's something about C# that I'm missing - Abstract classes and inheriting languages from some interface about conjugation patterns, I don't know. The thing is, whether or not I add more languages is almost a non-issue, because the implementation of conjugations can't really be shared anyway. – Chris Cirefice Jun 01 '15 at 20:19
  • @ChrisCirefice: do you think Microsoft Word's spelling engine has everything hardcoded in it? Your case is essentially the same. Yes, grammar and structure varies; this makes it even more important to move this part outside your code. – Arseni Mourzenko Jun 01 '15 at 20:33
  • 1
    Well of course not everything is hard-coded there :P so I guess it really comes down to determining what I want the interface to look like so that I can apply it to multiple languages. The problem is that I don't know all the languages well enough yet, so that's effectively impossible. I think the one thing that maybe *you're* missing is that conjugation patterns (this is all I'm talking about) are very static in a language, and it truly is something that is case-by-case. There are around 17 conjugation patterns in French for verbs. It's not going to extend any time soon... – Chris Cirefice Jun 01 '15 at 20:40
  • @ChrisCirefice: as I said in my answer, if there is only a "minor probability that the app will be extended somewhere in the future" [to other languages], then hardcoding those 17 conjugation patterns is the right thing to do. – Arseni Mourzenko Jun 01 '15 at 20:44
  • 4
    I disagree - I don't think it makes sense to move anything outside of the code before it comes naturally through refactoring. Start with one language, add others - at some point implementation of ILanguageRule will share enough code that it's just more efficient to have a single implementation parameterized with an XML (or other file). But even then you might end up with Japanese which has a completely different structure. Starting by pushing your interface to be XML (or similar) is only begging for having changes in the interface rather than the implementation. – ptyx Jun 01 '15 at 20:46
  • As @ptyx suggested in his answer, writing these rules in some grammar (that I would have to later parse from JSON or some other source) would cause a significant increase in complexity. It's kind of like writing a domain-specific language for each language's conjugation patterns, however the conjugation patterns/rules will likely need to only be written once (revisions to algorithms aside). French isn't going to randomly add a future-imperfectly-past-present tense, so hard-coding these rules (namespacing appropriately) seems to be the right thing to do... – Chris Cirefice Jun 01 '15 at 20:57
  • @ptyx: you highlight an important point. I haven't specified how are the rules stored, which led readers think that it should be in an XML or JSON (don't know why, since I never mentioned neither of two). The answer is now edited and should be clearer. Where I disagree with you is that your approach has a drawback of forcing a recompile every time a rule should be added or changed. – Arseni Mourzenko Jun 01 '15 at 21:03
  • @MainMa I understand your point now about being forced to recompile on a small change, and yes that could be a significant issue if this were just about any other use-case. The good thing about this, and why I asked the question, is that this particular use case is hard-coding a very small set of rules and strings to enforce those rules that are very likely to ever *need* updating. Unless I royally screw something up and mess up a conjugation pattern on release (unlikely thanks to unit tests), then I would simply patch and move on. I do thank you for your answer though, as it is great advice! – Chris Cirefice Jun 01 '15 at 21:13
  • 1
    @MainMa: the spell checker example does not fit well, because for a spell checker the word lists change often. In MS Word, even the user can extend the word lists during proof reading. The OP's case seems to be pretty different from that. – Doc Brown Jun 01 '15 at 21:16
  • @MainMa agreed about the inconvenience of recompile if in Java. It needs to be weighted. The dynamic language is a good idea - but probably overkill as well to start with unless requirements explicitly spell out 'you can change rules without a new release'. – ptyx Jun 01 '15 at 22:07
  • 2
    Note: if you want to add more languages, that doesn't imply moving the languages to a config file! You could equally well have a `LanguageProcessor` class with multiple subclasses. (Effectively, the "config file" is actually a class) – user253751 Jun 01 '15 at 22:35
  • @immibis That's more or less what I was thinking. For any given language, the sets of strings for conjugation terminations and full conjugation tables for irregular verbs is actually very small, and very unlikely to change in the near future, thus why I was thinking of saving time and effort and just storing the strings directly in code (in some sort of data class per language), nested under the appropriate namespace. – Chris Cirefice Jun 02 '15 at 12:34
  • 2
    @MainMa: Why do you consider it a problem to recompile when adding a word? You have to recompile when doing any other change to the code anyway, and the list of words is probably the part of the code which is *least* likely to change over time. – JacquesB Jun 02 '15 at 13:48
  • 3
    I suspect that the flexibility of being able to code the very specific grammatical rules of each language in a subclass would be more convenient in the end than the ability to load those same rules somehow from a config file (because you'd basically be writing your own programming language to interpret the configurations). – David K Jun 02 '15 at 18:18
15

Strings should be extracted to a configuration file or database when the values could change independently from the program logic.

For example:

  • Extracting UI texts to resource files. This allows a non-programmer to edit and proof-read the texts, and it allows adding new languages by adding new localized resource files.

  • Extracting connection strings, urls to external services etc. to configuration files. This allows you to use different configurations in different environments, and to change the configurations on the fly because they may need to change for reasons external to your application.

  • A spell checker which has dictionary of words to check against. You can add new words and languages without modifying the program logic.

But there is also a complexity overhead with extracting to configuration, and it doesn't always make sense.

Strings may be hardcoded when the actual string cannot change without changing the program logic.

Examples:

  • A compiler for a programming language. The keywords are not extracted to a configuration, since each keyword have specific semantics which has to be supported by code in the compiler. Adding a new keyword will always require code changes, so no value in extracting the strings to a configuration file.
  • Implementing a protocol: Eg. a HTTP client will have hardcoded strings like "GET", "content-type" etc. Here the strings are part of the specification of the protocol, so they are the parts of the code least likely to change.

In your case I think it is clear that the words are an integrated part of the program logic (since you are building a conjugator with specific rules for specific words), and extracting these words to an external file has no value.

If you add a new language you will need to add new code anyway, since each language have specific conjugation logic.


Some have suggested that you could add some kind of rule engine which allows you to specify conjugation rules for arbitrary languages, so new languages could be added purely by configuration. Think very carefully before you go down that road, because human languages are wonderfully weird so you need very expressive rule engine. You would basically be inventing a new programming language (a conjugation DSL) for dubious benefit. But you already have a programming language at your disposal which can do anything you need. In any case, YAGNI.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • 1
    Actually, in a comment to MainMa, I mentioned that writing a DSL for this would be pointless because very few natural languages are similar enough to make it worth the effort. Maybe French/Spanish/Italian would be *close enough*, but not really worth the extra effort considering that the amount of rules are highly static in any given language. The other points you mention about complexity were my exact worries, and I think you wonderfully understood what I was asking in my question and gave a great answer with examples, so +1! – Chris Cirefice Jun 02 '15 at 12:38
5

I agree 100% with Dan Pichelman's answer, but I would like one thing to add. The question you should ask yourself here is "who is going to maintain/extend/correct the word list?". If it is always the person who also maintains the rules of a specific language (the particular developer, I guess you), then there is no point in using an external configuration file if this makes things more complex - you will not get any benefits from this. From this point of view, it will make make sense to hardcode such word lists even if you have to change them from time to time, as long as it is sufficient to deliver a new list as part of a new version.

(On the other hand, if there is a slight chance someone else must be able to maintain the list in the future, or if you need to change the word lists without deploying a new version of your application, then use a separate file.)

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • This is a good point - however, i is very likely that I will be the only person actually maintaining the code, at least for the next few years. The good part about this is that even though the strings will be hard-coded, it is a very small set of strings/rules that are unlikely to change anytime soon (as it's natural language, which doesn't evolve too much year-to-year). That said, the conjugation rules, verb termination strings, etc. will in all likelihood be the same for our lifetime :) – Chris Cirefice Jun 01 '15 at 21:34
  • 1
    @ChrisCirefice": exactly my point. – Doc Brown Jun 01 '15 at 21:35
2

Even while hardcoding seems fine here, and better than dynamically loading config files, I still would recommend that you do strictly separate your data (the dictionary of verbs) from the algorithm. You can compile them right into your application in the build process.

This will save you a lot of hazzle with maintenance of the list. In your VCS you can easily identify whether a commit did change the algorithm, or just fix a conjugation bug. Also, the list might need to be appended in the future for cases you didn't consider. Especially, the number of the 32 irregular verbs you counted doesn't seem be exact. While those seem to cover the commonly used ones, I found references to 133 or even 350 of them.

Bergi
  • 996
  • 7
  • 15
  • Bergi, I did plan on separating the data from the algorithm. What you note about the French irregulars - the definition of *irregular* is misunderstood at best. What I mean when I say *irregular* are the verbs that cannot be 'calculated', or conjugated from their infinitive form alone. The irregular verbs have no particular pattern at all, and thus need to have their conjugations explicitly listed (under French.Verb.Conjugation.Irregular` for instance). Technically, -ir verbs are 'irregular', but they actually have a fixed conjugation pattern :) – Chris Cirefice Jun 01 '15 at 21:18
0

The important part is separation of concern. How you achieve that is less relevant. i.e. Java is fine.

Independently of how the rules are expressed, should you need to add a language of change a rule: how many code and files do you have to edit?

Ideally, adding a new language should be possible by either adding a 'english.xml' file or a new 'EnglishRules implements ILanguageRules' object. A text file (JSON/XML) gives you an advantage if you want to change it outside of your build lifecycle, but require a complex grammar, parsing, and will be harder to debug. A code file (Java) let you express complex rules in a simpler way, but require a rebuild.

I would start with a simple Java API behind a clean language agnostic interface - as you need that in both cases. You can always add an implementation of that interface backed by an XML file later if you wish, but I don't see the need to tackle that problem immediately (or ever).

ptyx
  • 5,851
  • 2
  • 22
  • 21