1

Backstory (You can skip)

I am writing a pronunciation library for irregular words. Take something like the following:

T1E1s    // tee one E one es | tee one E ones
1994-1995// 1994 (minus|dash|to|) 1995 
19-mm    // 19 dash mmmmmmmmmmmmmm | 19 dash millimeter | 19 dash em em
4x4      // 4 ex 4 | 4 times 4 | 4 by 4

What you see for every word, are the different possible interpretations. Tackling the issue is pretty taxing, but is honestly pretty straight forward. Basically, I parse the word into various types denoted by this enumerator, as such:

    enum StringType
    { // Heirarchical
        Acronymn                  , // "mm","мм"
        Measurement               , // pound

        Number                    , // 0-9

        Character_Times           , // x × X
        Character_Dash            , // -
        Character_ForwardSlash    , // /
        Character_Latin_Lower     , // a
        Character_Latin_Upper     , // A
        Character_Latin_Plural    , // s

        Consanants_Latin_Proper   , // Fff
        Consanants_Latin_Lower    , // fff
        Consanants_Latin_Mixed    , // fFF
        Consanants_Latin_Upper    , // FFF

        Word_Latin_Proper         , // Foo
        Word_Latin_Lower          , // foo
        Word_Latin_Mixed          , // fOO
        Word_Latin_Upper          , // FOO

        Consanants_Cyrillic_Proper, // Fff
        Consanants_Cyrillic_Lower , // fff
        Consanants_Cyrillic_Mixed , // fFF
        Consanants_Cyrillic_Upper , // FFF

        Word_Cyrillic_Proper      , // Фоо
        Word_Cyrillic_Lower       , // фоо
        Word_Cyrillic_Mixed       , // фОО
        Word_Cyrillic_Upper       , // ФОО
    };

thus, a word like 19-mm is parsed like so:

Halt: void mapper(QString) /home/akiva/Programming/Blanket/main.cpp:206
[QStringList] sp.parsedStrings()
QStringList:: QStringList
0   : 19
1   : -
2   : mm
QList<QCD::StringType>:: QList<QCD::StringType>
0   : Number
1   : Character_Dash
2   : Acronymn
QString:: "19-mm"

The taxing part is where I have to tackle each case with its own implementation, and by the end of this, I imagine I will have something like 500 different combinations I will need to program functions for. This is where things get messy, because I do not want this:

     if (string.types() == QList<StringType>() << StringType::Number << StringType::Dash << StringType::Acronymn) { Number_Dash_Acronymn(); }
else if (string.types() == QList<StringType>() << StringType::Number << StringType::Dash << StringType::Number) { Number_Dash_Number(); }
else if (string.types() == QList<StringType>() << StringType::Number << StringType::Times << StringType::Number) { Number_Times_Number(); }
// + 500 additional if else statements

500 if else statements is not acceptable. An enumerator with 500 different values is also disgusting for all the reasons that you can imagine. I had floated the idea of using bitflags, but that ended up being far too limited (32 bits = only 32 parameters). Thus I think I have the best possible approach detailed below:

Actual Issue:

I want something like this:

switch (stringTypes)    
case Number + Dash + Word_Latin_Lower: {
        /* code */
        break;
} case Word_Latin_Lower + Dash + Number: { 
        /* code */
        break;
} default:
        ct_Error("Failed to account for the combination: ", stringTypes);
        break;
}

The obvious issue being that the first two cases have the same value, despite being in a different order. Regardless, if the code was functional, it would be foldable, readable, efficient, and easy to sort. Not to mention, I won't have to touch my header file to add new enumerators or functions, thus drastically helping my compile times.

Thus, how should I give combined enumerators guaranteed unique values, so much so that the order it is given also guarantee a unique value, while still maintaining readability?

Thanks.

Anon
  • 3,565
  • 3
  • 27
  • 45
  • You shouldn't be adding. You should be concatenating. – candied_orange May 13 '19 at 02:28
  • @candied_orange Yeah I know I shouldn't be adding, I just put it there as obvious broken code. As for concatenating, what are you going to concatenate that will work in a switch? `1` `2` `3` will give `123`, but so will `12` `3`. – Anon May 13 '19 at 02:31
  • You haven't made clear why you need 1 2 3 to be different from 12 3 – candied_orange May 13 '19 at 02:50
  • How many cases are there? Does every combination have a specific case? – user253751 May 13 '19 at 03:11
  • @candied_orange Well, take the following two strings: `s40` and `40s`. The first is parsed as `Plural, Number`, the second, `Number, Plural`. If the case is `Plural, Number`, I pronounce it as `es fourty`. If the case is `Number, Plural`, I pronounce it as `fourties`. Now, if the enumerator for `Number` is `2`, and the enumerator for `Plural` is `22`, both cases are going to fire, which is not good. – Anon May 13 '19 at 03:22
  • @immibis well right now, I have about 25 enumerators, meaning that I potentially have up to 25×24×23×22×...×2×1 cases, but I reckon that I will probably end up with something around 500, maybe 1000 cases as I add to it over the years. edit: No wait Im stupid. This isn't a deck of cards. But to the point, I think 500 is a safe bet because an irregular string can parsed up to 6 or 7 components sometimes, each component having 10 or 20 different potential states. – Anon May 13 '19 at 03:25
  • 3
    7 components with 20 states: you could make a bitfield with 7 groups of 5 bits. 4 if the component only has 16 states. If you had 4 components with 20 states and 3 with 16 states it would be exactly 32 bits. But also you'd have 655,360,000 combinations which is why I don't think you will be handling them all individually. – user253751 May 13 '19 at 03:35
  • @immibis I am intrigued, but it seems somewhat convoluted, which would be an issue because the enumerator has to be flexible enough that down the road, I can add different states to the enumerator. For example, I will be adding enumerators for arabic script words, Hebrew script words, Cyrillic, etc. You are right, I will not handle 655,360,000 individually, but it will be set up that whenever a combination hits that has not been accounted for, I will be prompted to program that in. That is how it goes right now for other parts of my work, and usually it ends up being about 500 different cases. – Anon May 13 '19 at 03:49
  • 1
    How about hashing the string concatenation using a 64-bit hash with good distribution (e.g. FNV-1a)? That should make any combination unique with a very, very high probability (and you can still check by stuffing them into a dictionary). – Martin Maat May 13 '19 at 05:19
  • @MartinMaat Just to clarify, is it a hash with function pointer values, or a hash with integral values to use in switch cases? I am not sure if the latter is possible as I am a bit vague on whether it would consider the hash value to be variable or const. As to function pointers if you are suggesting it, the reason I do not like function pointers is because it adds a lot of cruft code. – Anon May 13 '19 at 18:10
  • 1
    @Akiva I elaborated on the thought in an answer. – Martin Maat May 13 '19 at 21:37
  • 1
    Have you considered using separate [bit fields](https://docs.microsoft.com/en-us/cpp/cpp/cpp-bit-fields?view=vs-2019) (perhaps in a struct) instead of a single value? – John Wu May 14 '19 at 02:13
  • @JohnWu I have to read up more on bitfields, but I am definitely open to it. If I understand, it kind of will give me a unique signature of a struct that can be translated into an integral, right? Could that be used in a switch? – Anon May 14 '19 at 03:22
  • 2
    You will have to compute the size based on your actual needs, but I doubt it. If you have `n` bits of information, it requires `n` bits of storage; no getting around that. Personally I find `switch` awkward. A series of `if` blocks (with [early return](https://softwareengineering.stackexchange.com/questions/18454/should-i-return-from-a-function-early-or-use-an-if-statement) of course) is a much clearer way to express flow of control, and it saves you a level of indent. – John Wu May 14 '19 at 07:55

4 Answers4

1

It is not that straightforward but doable nonetheless. First you can create a new enum for your 500 or so string type sequences. To make this readable, eliminate all underscores from your current StringType enum members and use these underscores to separate the parts in your new StringTypeSequence enum. So you could get something like this:

StringTypeSequence
{
    Number_Dash_WordLatinLower,
    WordLatinLower_Dash_Number,
    ...
    et cetera, all the way up to your 500 or so
}

Then you write a parser that picks up the member names from your code file and turns them into hash values. You use these hash values to update your enum:

StringTypeSequence
{
    Number_Dash_WordLatinLower = 0x7C34012B,
    WordLatinLower_Dash_Number = 0x058AAE67,
    ...
    et cetera, all the way up to your 500 or so
}

I just made up some values, the real ones will be different. Anyway...

You can now create your switch statement, casing on the StringTypeSequence members.

You then parse your text and you find a "Number + Dash + WordLatinLower" sequence. You concatenate the enum member names into the string "Number_Dash_WordLatinLower", calculate the hash for it and switch on it. You will end up at the case for enum member Number_Dash_WordLatinLower. If you end up in the default clause you found a new sequence you had not anticipated yet and you throw an exception, presenting the hash for the enum value and case to be added.

I am not sure if your version of C++ allows you to obtain the member name string of an enum value, you may need a CLR type language with reflection capability for that. If you cannot do that, just add an array of strings to your StringType enum that matches the enum members so you can use the enum values as indexes into the array to get to the names.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • `I am not sure if your version of C++ allows you to obtain the member name string of an enum value,` I use Qt, so yes. QMetaEnum and Q_ENUM. `First you can create a new enum for your 500 or so string type sequences.` If I have 500 cases to account for, and I manually make 500 enumerators for these, which is a bad idea because my compile times will shoot through the roof every time I have to add a new instance (given they can't be forward declared), why would I then need to go through the trouble of giving each a unique value? They are all already enumerated. – Anon May 13 '19 at 23:07
  • A big enum is not likely to result in a noticeible compile time increase. You need the fixed values because you need to map string type sequences you find, that are unknown at compile time, to enum members at runtime. – Martin Maat May 13 '19 at 23:17
  • 1
    @MartinMaat a huge number of values in enum won't increase compile times but doing it step by step: add several values, implement them, add several more, repeat is a royal PITA. Also if you use better enums (a library) and your hashing function is constexpr it would be trivial to generate two maps (one each way) at compile time in several lines of code. Although better enums require a wrapper in Qt since they (by design) don't have a default constructor. – jaskij May 13 '19 at 23:52
  • Also your answer would be improved by adding some information on combining hashes. – jaskij May 13 '19 at 23:53
  • @Jan Dorniak It do not think it would be a real PITA, more like a gentle glowy feeling every time you find a new sequence. Besides, this is not different from any solution, it is independent of the science behind it (the possible combinations you choose to handle are a given, what you do with unanticipated sequences is a different problem, how often you want to update is a free choice). Uniqueness of hashes can be easily guaranteed by having the generator stuff them in a dictionary that will bark on the first collision, in which extremely unlikely event you would change a member name slightly. – Martin Maat May 14 '19 at 05:40
  • @Jan Dorniak I do not understand what you mean by combining hashes, as a part to improve the answer on. I admit the whole approach is not very elegant and a bit of a hassle, though solving a problem and yielding efficient and readable code. – Martin Maat May 14 '19 at 05:48
  • @MartinMaat you are right about combining hashes, your approach does not need it since you hash whole strings. And thinking more about this and reading OP's comments: the whole point of the question was to avoid explicitly stating those 500 cases outside that one switch. – jaskij May 14 '19 at 08:26
1

Summary

  1. Use a class to store the fields of StringType
  2. Refactor your 500 methods into command classes and use the chain of responsibility pattern.

Eliminate those 500 ad-hoc checks

If you have 500 functions that could be called, I'd say a bigger problem is the massive amount of code to check for each condition and invoke each function. Even with a switch statement, it's likely to be a bear to read and maintain.

Instead, I'd like to suggest you use the chain of responsibility design pattern so that the list of functions is abstract and you can iterate over it. The next sections show you how.

Define a class to contain the StringType information

First, define a class that can contain all the information you currently require in StringType. Yes, a class, not an integer (which does not have enough storage capacity to handle the possible combinations). For example:

class StringType
{
public:
    StringType() : Acronymn(), Dash(), Number() {}
    bool Acronymn;
    bool Dash;
    bool Number;
};

To use this class, you'd set individual fields instead of setting bits in an enum.

If you are concerned this will take up too much space, you can use bit fields to compact the data structure. I doubt this is really necessary though.

Define an abstraction for checking and handling different string types

Second, write a base class that exposes functions to (1) check if the class can handle a particular StringType, and (2) to handle it.

class StringHandler
{
public:
    virtual bool CanHandle(StringType* stringType) = 0;
    virtual void Execute() = 0;
};

Implement the abstraction

You then implement this class for each possible method call that results from a combination of flags. Here are two examples:

class NumberAcronymnHandler : public StringHandler
{
public:
    bool CanHandle(StringType* stringType) override
    {
        return (stringType->Number && stringType->Acronymn);
    };

    void Execute() override
    {
        Number_Acronymn();  //Here's where you do the work
    }
};

class NumberDashAndAcronymnHandler : public StringHandler
{
public:
    bool CanHandle(StringType* stringType) override
    {
        return (stringType->Number && stringType->Acronymn && stringType->Dash);
    };

    void Execute() override
    {
        Number_Dash_Acronymn();  //Here's where you do the work
    }
};

Eliminate the switch statement by iterating over the command classes

To figure out which class can handle a StringType, iterate over an array of these classes and check each one. This example supports 2 types, but if there were 500, the code would be more or less the same; it's just a loop over an array.

void Example(StringType* stringType)
{
    StringHandler* map[2] = 
    {
        new NumberAcronymnHandler(), 
        new NumberDashAndAcronymnHandler() 
    };

    for (int i=0; i<2; i++)
    {
        StringHandler* handler = map[i];
        if (handler->CanHandle(stringType))
        {
            handler->Execute();
        }
    }
};

This solves the issue with your enumerated type, and also adds maintainability and structure to your 500+ methods.

John Wu
  • 26,032
  • 10
  • 63
  • 84
0

You are right in thinking that 500 if-else conditions is unreadable. Same thing for 500 switch cases.

The first thing to consider is whether having 500 different functions actually makes sense. Are those functions all very different or will they contain a lot of duplication? If so, how can you factor out this duplication? You should aim to find a canonical representation of these functions, i.e. only write the data/code that differentiates it from all other functions.

To structure actually picking the right function to call, you simply build a lookup table (hash table, if possible) linking the sequence to a callback function.

D Drmmr
  • 290
  • 2
  • 6
  • `The first thing to consider is whether having 500 different functions actually makes sense.` Considered, and yes. They will all be very different, and trying to deduplicate them will require more assumptions upon my part, and has led to many embarrassing errors. `world war eye`, `1993 minus 94`, `37 dash millimeter`. Its language, and its so fickle, that even a human reader is often tripped up by how people write things. – Anon May 13 '19 at 22:56
0

Inspired by Martin Maat's answer but using number hashing instead of string hashing.

Disclaimer: There is one downside to all of this solutions: they use hashes. And as is usual with hashes: while improbable collisions are a possibility. So it might end up that you will and up with a collision where two combinations resolve to the same hash and you have to treat it as a special case. This is a nasty bug waiting to happen so document this behaviour well if you do end up using this.

Solution 1:

// you need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
    return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}

// use like this
switch(qHash(stringTypes): {
case qHash(QList<StringType>{ Number, Dash, Word_Latin_Lower }): {
    /* code */
    break;
}
case qHash(QList<StringType>{ Word_Latin_Lower, Dash, Number }): {
    /* code */
    break;
}
default:
    ct_Error("Failed to account for the combination: ", stringTypes);
    break;
}

Solution 2:

This one has a bit more syntactic sugar

// you need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
    return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}

// this is pureliy syntactic sugar
uint StringTypeListHash(const QList<QCD::StringType>& list)
{
    return qHash(list);
}

// for this usage
switch(StringTypeListHash(stringTypes): {
case StringTypeListHash({ Number, Dash, Word_Latin_Lower }): {
    /* code */
    break;
}
case StringTypeListHash({ Word_Latin_Lower, Dash, Number }): {
    /* code */
    break;
}
default:
    ct_Error("Failed to account for the combination: ", stringTypes);
    break;
}

Solution 3:

Operator overloading. Very pretty code, should work. I chose the xor operator because it is most commonly associated with combining hashes.

I'm not 100% sure this won't have unintended sideffects if QCD::StringType is an enum instead of enum class. As usual - take caution with operator overloading and don't blindly use code someone on the internet whipped up without even trying to compile it.

// you still need qHash() overload for QCD::StringType
uint qHash(QCD::StringType st)
{
    return qHash(static_cast<std::underlying_type<QCD::StringType>::type>(st));
}

// here comes the magic
namespace QCD {

    // this is basically the implementation of boost::hash_combine
    // taken from: https://stackoverflow.com/a/27952689/6178613
    uint hash_combine( uint lhs, uint rhs ) {
        lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
        return lhs;
    }

    size_t operator^(StringType lhs, StringType rhs) {
        using enum_int_type = std::underlying_type<StringType>::type
        std::size_t hash = hash_combine(qHash(static_cast<enum_int_type>(lhs)), qHash(static_cast<enum_int_type>(rhs)));
    }

    size_t operator^(StringType lhs, size_t rhs) {
        using enum_int_type = std::underlying_type<StringType>::type
        std::size_t hash = hash_combine(qHash(static_cast<enum_int_type>(lhs)), rhs);
    }

    size_t operator^(size_t lhs, StringType rhs) {
        using enum_int_type = std::underlying_type<StringType>::type
        std::size_t hash = hash_combine(lhs, qHash(static_cast<enum_int_type>(rhs)));
    }
}

// usage
switch(qHash(stringTypes): {
case Number ^ Dash ^ Word_Latin_Lower: {
    /* code */
    break;
}
case Word_Latin_Lower ^ Number ^ Dash: {
    /* code */
    break;
}
default:
    ct_Error("Failed to account for the combination: ", stringTypes);
    break;
}
jaskij
  • 575
  • 2
  • 8