94

I've seen a number of questions, like this, asking for advice on how to store enums in DB. But I wonder why would you do that. So let's say that I have an entity Person with a gender field, and a Gender enum. Then, my person table has a column gender.

Besides the obvious reason of enforcing correctness, I don't see why I would create an extra table gender to map what I already have in my application. And I don't really like having that duplication.

Thomas Owens
  • 79,623
  • 18
  • 192
  • 283
user3748908
  • 1,607
  • 2
  • 14
  • 15
  • 10
    Possible duplicate of [Is it wasteful to create a new database table instead of using enum data type?](http://programmers.stackexchange.com/questions/298472/is-it-wasteful-to-create-a-new-database-table-instead-of-using-enum-data-type) – Doc Brown Dec 15 '15 at 16:01
  • 1
    Where else would you store data that may change regularly? While you may have thought of all the options what if someone comes along and wants to add a new option. Are you ready to tweak that hard-coded list? Someone may want to give their gender as something other than male or female, e.g. intersex for example. – JB King Dec 15 '15 at 16:08
  • 5
    @JBKing ... just look at Facebook's gender list. –  Dec 15 '15 at 16:35
  • Somehow related: http://programmers.stackexchange.com/q/287398/57759 + http://programmers.stackexchange.com/q/287433/57759 ? – trejder Dec 15 '15 at 20:24
  • @DavorŽdralo I *will* point out that for gender, the system that I am currently working with has a more complex definition than the simple binary designation... and I'm about as far as the tumblerites as one can get. While its not in the scores that Facebook has, it is a *very* important issue that needs to be incorporated in the system. –  Dec 15 '15 at 20:47
  • @MichaelT - it may be an issue in, say, a financial system. It sure as hell is not an issue when it comes to gender. – Davor Ždralo Dec 15 '15 at 20:56
  • 17
    Can't help but link the [classical text on problems of gender representation in a database](http://qntm.org/gay). – 9000 Dec 15 '15 at 21:17
  • 3
    If your customers are "deluded Tumblrites", then you damn well create a database schema that lets you create something that serves their needs, at least, if you intend to remain in business. – Gort the Robot Dec 16 '15 at 00:20
  • I have summarized the answers and provided my perspective on this here https://stackoverflow.com/a/76439681/12589591 – Hem Bhagat Jun 09 '23 at 11:15

8 Answers8

89

Let's take another example that is less fraught with conceptions and expectations. I've got an enum here, and it is the set of priorities for a bug.

What value are you storing in the database?

So, I could be storing 'C', 'H', 'M', and 'L' in the database. Or 'HIGH' and so on. This has the problem of stringly-typed data. There's a known set of valid values, and if you aren't storing that set in the database, it can be difficult to work with.

Why are you storing the data in the code?

You've got List<String> priorities = {'CRITICAL', 'HIGH', 'MEDIUM', 'LOW'}; or something to that effect in the code. It means that you've got various mappings of this data to the proper format (you're inserting all caps into the database, but you're displaying it as Critical). Your code is now also difficult to localize. You have bound the database representation of the idea to a string that is stored in the code.

Anywhere you need to access this list, you either need to have code duplication or a class with a bunch of constants. Neither of which are good options. One should also not forget that there are other applications that may use this data (which may be written in other languages - the Java web application has a Crystal Reports reporting system used and a Perl batch job feeding data into it). The reporting engine would need to know the valid list of data (what happens if there's nothing marked in 'LOW' priority and you need to know that that is a valid priority for the report?), and the batch job would have the information about what the valid values are.

Hypothetically, you might say "we're a single-language shop - everything is written in Java" and have a single .jar that contains this information - but now it means that your applications are tightly coupled to each other and that .jar containing the data. You'll need to release the reporting part and the batch update part along with the web application each time there is a change - and hope that that release goes smoothly for all parts.

What happens when your boss wants another priority?

Your boss came by today. There's a new priority - CEO. Now you have to go and change all the code and do a recompile and redeploy.

With an 'enum-in-the-table' approach, you update the enum list to have a new priority. All the code that gets the list pulls it from the database.

Data rarely stands alone

With priorities, the data keys into other tables that might contain information about workflows, or who can set this priority or whatnot.

Going back to the gender as mentioned in the question for a bit: Gender has a link to the pronouns in use: he/his/him and she/hers/her... and you want to avoid hard coding that into the code itself. And then your boss comes by and you need to add you've got the 'OTHER' gender (to keep it simple) and you need to relate this gender to they/their/them... and your boss sees what Facebook has and... well, yeah.

By restricting yourself to a stringly-typed bit of data rather than an enum table, you've now needed to replicate that string in a bunch of other tables to maintain this relationship between the data and its other bits.

What about other datastores?

No matter where you store this, the same principle exists.

  • You could have a file, priorities.prop, that has the list of priorities. You read this list in from a property file.

  • You could have a document store database (like CouchDB) that has an entry for enums (and then write a validation function in JavaScript):

      {
         "_id": "c18b0756c3c08d8fceb5bcddd60006f4",
         "_rev": "1-c89f76e36b740e9b899a4bffab44e1c2",
         "priorities": [ "critical", "high", "medium", "low" ],
         "severities": [ "blocker", "bad", "annoying", "cosmetic" ]
      }
    
  • You could have an XML file with a bit of a schema:

      <xs:element name="priority" type="priorityType"/>
    
      <xs:simpleType name="priorityType">
        <xs:restriction base="xs:string">
          <xs:enumeration value="critical"/>
          <xs:enumeration value="high"/>
          <xs:enumeration value="medium"/>
          <xs:enumeration value="low"/>
        </xs:restriction>
      </xs:simpleType>
    

The core idea is the same. The data store itself is where the list of valid values needs to be stored and enforced. By placing it here, it is easier to reason about the code and the data. You don't have to worry about defensively checking what you have each time (is it upper case? or lower? Why is there a critical type in this column? etc...) because you know what you are getting back from the datastore is exactly what the datastore is expecting you to send otherwise - and you can query the datastore for a list of valid values.

The takeaway

The set of valid values is data, not code. You do need to strive for DRY code - but the issue of duplication is that you are duplicating the data in the code, rather than respecting its place as data and storing it in a database.

It makes writing multiple applications against the datastore easier and avoids having instances where you will need to deploy everything that is tightly coupled to the data itself - because you haven't coupled your code to the data.

It makes testing applications easier because you don't have to retest the entire application when the CEO priority is added - because you don't have any code that cares about the actual value of the priority.

Being able to reason about the code and the data independently from each other makes it easier to find and fix bugs when doing maintenance.

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
  • One additional facet to code duplication is that it has been known for one database to serve multiple programs written in different languages :-) Then no kind of defining the enum in the code will prevent code duplication, although I suppose you could auto-generate code for each language from a common file. – Steve Jessop Dec 15 '15 at 18:38
  • 10
    If you can add an enum value to your code without having to change *any* logic (and lest it be the localized display of it), I doubt the necessity for the additional enum value in the first place. And while I'm old enough to value the ability to easily query database backups with simple SQL queries to analyse a problem, with ORMs these days you can do very well without having to look at the underlying database at all. I don't understand the point about localisation (pronouns) here though - that stuff certainly shouldn't be in a database, but resource files of some kind I'd say. – Voo Dec 15 '15 at 20:30
  • 1
    @Voo the pronouns is an example of *other* data related to this enumesque value. Without the data being in a table, the stringly typed values would need to be there with no proper FK constraints. If you have pronouns (like this) in a resource file, you've got coupling between the database and the file (update the database and redeploy the file). Consider the enums of [redmine](http://www.redmine.org/projects/redmine/wiki/RedmineEnumerations) which are modifiable via admin interface on the fly without having to do a redeploy. –  Dec 15 '15 at 20:41
  • 2
    ... remember also that databases are a polyglot data store. If you are requiring the validation to be done as part of the ORM in one language, you have made it necessary to duplicate that validation in any other language you use (I recently worked with a Java front end that had Python pushing data into the database - the Java ORM and the Python systems have to agree on things - and that agreement (the valid types) was most easily implemented by having the database enforce it with a 'enum' table.). –  Dec 15 '15 at 20:44
  • @Michael Cross-platform access to the DB sounds like a good reason for such an implementation, even if you have only a small fixed set of values, yep. The redmine enums on first glance (not familiar with it) do not look like what I'd usually consider an enum in the programming language sense. If you want to extend them during runtime or allow users to specify different types, it makes perfect sense to have it in the database. – Voo Dec 15 '15 at 21:17
  • 3
    @Voo the Redmine usage of enum is the same as [bugzilla](https://wiki.mozilla.org/Bugzilla:Modifying_Your_Installation) "the most important table contains all of the bugs of the system. It is made up of various bug properties including all of the enum values like severity and priority." - Its not a free form text field, it is a value that is one of this known and enumerable set. It isn't a *compile* time enum, but its still enumish. See also [Mantis](https://www.mantisbt.org/manual/admin.customize.enums.html). –  Dec 15 '15 at 21:24
  • @MichaelT what about non-relational database store? Unix just uses plain text files, OS X uses Xml files, etc. – imel96 Dec 16 '15 at 00:22
  • @imel96 thank you for bringing up that point I overlooked. Did the edit properly address what you were thinking of? –  Dec 16 '15 at 00:46
  • @MichaelT yes, it gives more emphasis on separation of data and programs. – imel96 Dec 16 '15 at 03:33
  • 2
    So to confirm - your point is that people should never use Enums? Wasn't clear. – niico Jul 11 '16 at 19:29
  • 1
    I've read this answer twice and I still don't get it. "There's a new priority - CEO. Now you have to go and change all the code and do a recompile and redeploy. With an 'enum-in-the-table' approach, you update the enum list to have a new priority." So, the argument is in favor of adding a type, without changing the code at all. Then "Data rarely stands alone". Well, exactly. "By restricting yourself to a stringly-typed bit of data ..., you've now needed to replicate that string". No, you update your localization tables. And you update your code to do something useful with 'CEO'. – Elise van Looij May 30 '17 at 16:06
  • @user40980 If enum is added as a table we have to perform one additional join right? – Jilson May 14 '20 at 19:53
  • With such long answer, it is way difficult to get the gist of it even though most of them indeed make great sense – Rui Mar 09 '21 at 13:42
  • 1
    _"What happens when your boss wants another priority?"_ I can agree with this from a historical perspective, but modern day development strongly incentivizes (a) no manual meddling in the production database and (b) easily redeployable software. In a modern situation, it should be easier and more desirable to update an enum and redeploy the software than it is to give a developer write access to a production database so they can manually insert some structural data. – Flater Mar 10 '21 at 11:58
  • It's been standard practice for a long time to store such data in what are usually known as _code tables_ (_e.g._, US State names keyed by two-letter codes). The answer makes it clear why this has become standard practice. – holdenweb Nov 04 '21 at 14:16
25

Which of these do you think is more likely to produce mistakes when reading the query?

select * 
from Person 
where Gender = 1

Or

select * 
from Person join Gender on Person.Gender = Gender.GenderId
where Gender.Label = "Female" 

People make enum tables in SQL because they find the latter to be more readable - leading to fewer errors writing and maintaining SQL.

You could make gender a string directly in Person, but then you would have to try and enforce case. You also may increase the storage hit for the table and the query time due to the difference between strings and integers depending on how awesome your DB is at optimizing things.

Telastyn
  • 108,850
  • 29
  • 239
  • 365
  • 11
    But then we're joining tables. If my entity has two enums, I will be joining three tables just for a simple query. – user3748908 Dec 15 '15 at 16:14
  • 14
    @user3748908 - so? Joins are what DBs are good at, and the alternatives are worse - at least in the eyes of people who chose this route. – Telastyn Dec 15 '15 at 16:16
  • 9
    @user3748908: Not only are databases really good at doing joins, they're also really good at enforcing consistency. Enforcing consistency works really, really well when you can point a column in one table at another's identifying row and say "the value for this column has to be one of the identifiers in that table." – Blrfl Dec 15 '15 at 17:43
  • 2
    This is all true but there are many cases where you need to sacrifice the joins for performance reasons. Don't get me wrong I'm all about this type of design and joining but I'm throwing that the world is not going to end if you find you sometimes don't need the joins due to performance. – JonH Dec 15 '15 at 18:16
  • 3
    If you're having to drop joining to reference tables for performance reasons @JonH you need to buy a bigger server or stop trying to push predicates through large numbers of sub-queries (I'm assuming you know what you're doing). References tables are the stuff that should be in your cache within a few seconds of starting the DB. – Ben Dec 15 '15 at 19:13
  • @Ben - true most of this stuff should be cached...I'm just throwing it out there. – JonH Dec 15 '15 at 19:14
  • 1
    @Blrfl, The database can enforce consistency between its database tables, but it can't enforce consistency between tblGender and the Gender enum in your code. The database doesn't know if I spelled "Male" wrong, it just returns zero results, AKA fails silently. – user2023861 Dec 15 '15 at 20:27
  • 1
    @user3748908 it bluntly comes down to this: do you want a fancy spreadsheet or a relational database? If it's the latter then become comfortable with joins. – Pieter B Dec 16 '15 at 12:39
  • @user2023861 And that's why we have ISO 5218 : https://en.wikipedia.org/wiki/ISO_5218 : Male = 1, Female = 2, Unknown = 0, Not Applicable = 9 – Pieter B Dec 16 '15 at 12:42
  • @JonH: Under what circumstances would a join to a table with < 10 entries be performance issue? – JacquesB Dec 16 '15 at 13:03
  • @PieterB TIL ISO 5218. Of course that's not a sustainable solution. It is an interesting bit of trivia however. – user2023861 Dec 16 '15 at 15:13
  • @Telastyn One use case for this would be where a report or similar artifact is being generated from the DB (or via e.g. Crystal Reports going directly against the DB). The application code knows that 'M' corresponds to 'Male' etc. because of the enum ... but there are scenarios where you need the data without going through the application code. – David Dec 16 '15 at 16:05
  • This answer makes the classic mistake of assuming that primary keys must be integers. Primary keys (and by extension, foreign keys) can be composed of any type of column. If the database designer chose the gender.label column to be the primary key, there would be no difficulty reasoning about the data and joining would be unnecessary (though a join might still be necessary for more complex cases, such as for localization purposes or to get additional configuration information). – cimmanon Dec 16 '15 at 17:30
  • @user2023861: There are plenty of ways to handle it cleanly, like defining a list and generating the database and type definitions from it. I've done that scads of times and it works just fine as long as there's a single definition that you can point to as the authority for what's what. – Blrfl Dec 16 '15 at 18:08
  • @cimmanon - if my Person.Gender is a string (or other non-integer data type), why the hell would I have a separate Gender table? – Telastyn Dec 16 '15 at 18:08
  • @Blrfl, that works at design-time, but what happens two years down the road when you're in a different role and a junior developer goes and adds an item to your list? I hope the developer read the documentation (you need current documentation for this). The developer minimally will have to insert a matching row in the database table, and update and add at least a couple of unit tests just to keep the redundant definitions consistent. – user2023861 Dec 16 '15 at 18:58
  • @user2023861 Junior adds the item to the only list there is and the build process generates everything else, including a nice little piece of SQL to bring the database up to date. If the way to change it isn't documented, that's my failure and I get dinged for it. If Junior doesn't read it, he broke it and owns both pieces. The point is that this can be done safely with some thought and careful automation. – Blrfl Dec 16 '15 at 19:13
  • 1
    @Telastyn To constrain the values of the column to be limited to a specific set of data. That is the entire *point* of a foreign key in the first place. Unless you'd rather use check constraints instead? – cimmanon Dec 16 '15 at 19:38
  • @cimmanon - ah, I see what you mean now. – Telastyn Dec 16 '15 at 19:43
  • I don't think discussing the decision between performance and good design will bring any value to this answer. Simply because there are other factors and realities coming to play when such decisions are made. When realities are different than ideals then other concerns take place. Without truly understand the situational settings, talking about clear cut solutions and suggestions don't make much sense. – Tarik Dec 22 '15 at 06:34
  • If you can assign the value 1 to a variable then you are looking at the same or even greater readability in the first query. In fact in your first example, what the problem is not readability but comprehensiveness of the query. Using a variable would solve that issue. – Tarik Dec 22 '15 at 06:39
  • In addition to my previous comment, with a simple variable, you not only increase the comprehension but also performance. – Tarik Dec 22 '15 at 06:40
  • @tarik - I don't see how a tsql variable gains you anything but a whole lot of code duplication. – Telastyn Dec 22 '15 at 14:15
  • It makes the value more meaningful if you use set @genderMale = 1. You can think of it as a constant in this context that eliminates magic strings. – Tarik Dec 22 '15 at 14:19
  • @tarik - and you get to do that on _every single query_ that uses the table. And you get to hope you're right. – Telastyn Dec 22 '15 at 18:44
  • Yes but readability and duplication are two different things. They may help each other but in this context of this question what I am saying is that we cannot jusr approach the problem with sole readability concerns. A proper answer should contain different aspects of programming. – Tarik Dec 22 '15 at 19:18
  • @Telastyn: Databases may be optimized for joins, but that's not a reason to add joins willy nilly. No joins is still more performant than doing an optimized join. The only reason to have the name in the joined query is human readability _before_ the codebase mapped the int value to the enum value, which implies people manually running queries against the db directly. Even if this happens rarely for debug reasons, it should never be a driving factor as to why your queries always depend on an additional join purely for that human readability. – Flater Mar 10 '21 at 12:01
17

I can't believe people didn't mention this yet.

Foreign Keys

By keeping the enum in your database, and adding a foreign key on the table that contains an enum value you ensure that no code ever enters incorrect values for that column. This helps your data integrity and is the most obvious reason IMO you should have tables for enums.

Benjamin Gruenbaum
  • 5,157
  • 4
  • 24
  • 34
  • The question is just 5 lines long and clearly states "Besides the obvious reason of enforcing correctness". So no one has mentioned it because the OP is stating that it's obvious and he is looking for other justifications - PS: I agree with you, that's a good enough reason. – user1007074 Nov 13 '18 at 13:25
8

I'm in the camp that agrees with you. If you keep a Gender enum in your code and a tblGender in your database, you may run into trouble come maintenance-time. You'll need to document that these two entities should have the same values and thus any changes you make to one you must also make to the other.

You'll then need to pass the enum values to your stored procedures like so:

create stored procedure InsertPerson @name varchar, @gender int
    insert into tblPeople (name, gender)
    values (@name, @gender)

But think how you'd do this if you kept these values in a database table:

create stored procedure InsertPerson @name varchar, @genderName varchar
    insert into tblPeople (name, gender)
    select @name, fkGender
    from tblGender
    where genderName = @genderName --I hope these are the same

Sure relational databases are built with joins in mind, but which query is easier to read?


Here's another example query:

create stored procedure SpGetGenderCounts
    select count(*) as count, gender
    from tblPeople
    group by gender

Compare that to this:

create stored procedure SpGetGenderCounts
    select count(*) as count, genderName
    from tblPeople
    inner join tblGender on pkGender = fkGender
    group by genderName --assuming no two genders have the same name

Here's yet another example query:

create stored procedure GetAllPeople
    select name, gender
    from tblPeople

Note that in this example, you'd have to convert the gender cell in your results from an int to an enum. These conversions are easy however. Compare that to this:

create stored procedure GetAllPeople
    select name, genderName
    from tblPeople
    inner join tblGender on pkGender = fkGender

All of these queries are smaller and more maintainable when going with your idea of keeping the enum definitions out of the database.

user2023861
  • 770
  • 4
  • 10
  • 1
    What if it wasn't gender though. I think we're getting too hung up on *gender* being the field. What if the OP had said "So let's say that I have an entity Bug with a Priority field" - would your answer change? –  Dec 15 '15 at 22:03
  • 4
    @MichaelT The list of possible values of "priority" is a part of the code at least to the same extent that it is a part of the data. You do see graphical icons for various priorities? You don't expect they are pulled off the database? And stuff like that could be themed and styled and still represent the same range of values stored in DB. You cannot just change it in database anyway; you have presentation code to sync. – Eugene Ryabtsev Dec 16 '15 at 08:42
2

First you need to decide if the database will only ever be used by one application or if there is a potential for multiple applications to use it. In some cases a database is nothing more than a file format for an application (SQLite databases can often be used in this regard). In this case bit duplicating the enum definition as a table can often be fine and may make more sense.

However as soon as you want to consider the possibility of having multiple applications accessing the database, then a table for the enum makes a lot of sense (the other answers go into why in more detail). The other thing to consider will you or another developer want to look at the raw database data. If so, this can be considered another application use (just one where the lab gauge is raw SQL).

If you have the enum defined in code (for cleaner code and compile time checking) as well as a table in the database, I would recommend adding unit tests to verify that the two are in sync.

Eric Johnson
  • 444
  • 2
  • 5
1

I would create a Genders table for the reason that it can be used in data analysis. I could look up all the Male or Female Persons in the database to generate a report. The more ways you can view your data, the easier it will be to discover trending information. Obviously, this is very simple enumeration, but for complex enumerations (like the countries of the world, or states), it makes it easier to generate specialized reports.

zackery.fix
  • 119
  • 3
1

When you have a code enumeration that is used to drive business logic in code you should still create a table to represent the data in the DB for the many reasons detailed above/below. Here are a few tips to insure that your DB values stay in sync with the code values:

  1. Do not make the ID field on the table an Identity column. Include ID and Description as fields.

  2. Do something different in the table that helps developers know that the values are semi-static/tied to a code enumeration. In all other look-up tables (usually where values can be added by users) I typically have a LastChangedDateTime and LastChangedBy, but not having them on enum related tables helps me remember that they are only changeable by developers. Document this.

  3. Create verification code that checks to see that each value in the enumeration is in the corresponding table, and that only those values are in the corresponding table. If you have automated application "health tests" that run post-build, at it there. If not, make the code run automatically on application startup whenever the application is running in the IDE.

  4. Create production deliver SQL scripts which do the same, but from inside the DB. If created correctly they will help with environment migrations as well.

0

Depends also on who access the data. If you just have one application that might be fine. If you add in a data warehouse or a reporting system. They will need to know what that code means, what is the human redable version of the code.

Usually, the type table wouldn't be duplicated as an enum in the code. You could load the type table in a list that is cached.

Class GenderList

   Public Shared Property UnfilteredList
   Public Shared Property Male = GetItem("M")
   Public Shared Property Female = GetItem("F")

End Class

Often, type come and goes. You would need a date for when the new type was added. Know when a specific type was removed. Display it only when needed. What if a client want "transgender" as a gender but other clients don't? All of this information is best stored in the database.

the_lotus
  • 153
  • 7