Why does the relational model for a database matter?

Question

I am approaching a project where I'll be having to implement a database with my boss; we're a very small start up so the work environment is deeply personal.

He had given me one of the company databases before and it completely went against what I was taught (and read about) in school for RDBMS. For example, there are entire databases here that consist of one table (per independent database). One of those tables is 20+ columns long and for context, here are some of the column names from one table:

lngStoreID | vrStoreName | lngCompanyID | vrCompanyName | lngProductID | vrProductName

The point being is that where he should have individual tables that hold the entity data (name, size, date purchased, etc.) he shoves it all in one large table per database.

I want to improve this design, but I am not sure why a properly-normalized and segmented data model would actually improve this product. While I am familiar with database design from college and I understand how to do it, I am unsure why this actually improves databases.

Why does a good relational schema improve a database?

@RobbieDee: I didn't vote to close, but Normalization generally takes up more than a few pages in a textbook, and the close vote was "Too Broad." — Robert Harvey, Apr 26 '16 at 15:54
@RobertHarvey I think that this is one of those cases where the interpretation of _broadness_ and how deep someone answers is best left up to the individual. For example, one of the top QA on here at 2189 votes is "What technical details should a programmer of a web application consider before making the site public?" A couple of pages on that topic could easily be written and yet concise answers were still authored. — 8protons, Apr 26 '16 at 15:57
Yeah, that one is not a very good example. It's the vanishingly rare post that does really well on a really broad topic. It's the pearl in a vast sea of sand. The vast majority of such questions never go anywhere, or produce so much flotsam that they are essentially noise. — Robert Harvey, Apr 26 '16 at 15:59
@RobertHarvey 3NF can be covered in about 6 lines - see bullet points [here](https://support.microsoft.com/en-gb/kb/283878)... — Robbie Dee, Apr 26 '16 at 16:07
@RobbieDee: I haven't read it all, but I suspect it just discusses the mechanics, not the advantages... and the question wasn't "How does 3NF work?" — Robert Harvey, Apr 26 '16 at 16:08
@RobertHarvey Sorry - should have explained. 3NF is usually the limit to which normalization is taken. But I take the general point that you could write a tome on it if you were so inclined. — Robbie Dee, Apr 26 '16 at 16:11
@8protons I edited your question to remove the "[How do I explain ${something} to ${someone}?](http://meta.programmers.stackexchange.com/q/6629)" aspect and to be a little bit more focused and clear. — , Apr 26 '16 at 17:04
@Snowman Thank you very much, I appreciate that. Now this more specifically aims at getting the results/answers I'm looking for. — 8protons, Apr 26 '16 at 17:06
Without understanding more about WHY he de-normalized this database, no one here can even say if this design is incorrect. It PROBABLY would be improved via the normalization techniques you learned, but don't blindly assume as much. If those names never change (and I mean NEVER) then denormalizing it this way can improve query performance by skipping the need for a lot of joins, for example. — Graham, Apr 26 '16 at 19:20
+1 for looking for reasons instead of just blinding drinking the Kool-Aid. — user253751, Apr 26 '16 at 20:51
It is common for new employees to criticize established procedures without understanding the reasons behind them, even if those reasons are not technically sound. First find out *why* your boss built it that way. He/she may very well know that it's not a good design, but not have the knowledge (or more likely, time) to do it better. Any changes you propose will probably be received more positively if you respectfully acknowledge the reasons for the current design. — Pedro, Apr 26 '16 at 21:20
@8protons No need to be defensive. You can't expect me to consider in my answer/comment something that I can't see in your question. That said, "question" would have been a better word than "criticize," so I apologize if it sounded too harsh. I've been one of those critical/questioning new employees myself. — Pedro, Apr 26 '16 at 22:06
`He [the boss] had given me one of his databases before and it completely went against what I was taught (and read about) in school for RDBMS` <-- Welcome to the real world! — Möoz, Apr 27 '16 at 03:23
I'm not really a database guy, so take this with a grain of salt, but it looks to me that following the 3NF rules is the database equivalent of writing clean code: It doesn't improve your product _for you_. What it does is, it makes it vastly easier for _you_ to improve your product. — Solomon Slow, Apr 27 '16 at 17:49
I'd recommend Database Design for Mere Mortals by Michael J. Hernandez as a good layman's introduction to relational database design, and why it's important. — StockB, Apr 27 '16 at 19:11
I am reminded of my favorite relational database quote: "Normalize til it hurts, denormalize til it works" — Jake, Apr 27 '16 at 22:32
A friendly advice: If you are just 2 months into your career, please be careful before jumping into conclusions about design choices that were made by your very seniors. I do not mean that you could not be right while your seniors are wrong, but there is probably a history behind those technical decisions and probably there are reasons why changing design is impractical. Even a startup can have legacy code and the theory which you may have learned in University are often idealized and academic and will not match or scale to real world scenarios — senseiwu, Apr 28 '16 at 08:44
@zencv Thanks! I had asked politely if there was a reason why the DB wasn't normalized and the simple was response was, "Didn't think about it/no time". I made sure to probe for reasons that I maybe didn't see before coming here to ask :] I know enough to know that I don't know much haha — 8protons, Apr 28 '16 at 14:22
in that case, you should try to assess the efforts to normalize the database including any migration costs, risks associated with the migration and above all the opportunity costs of paying this technical debt (assuming there are NO valid technical reasons for the current design). In any commercial setup, s/w development is not just about design and programming, rather it has to be considered as an engineering discipline — senseiwu, Apr 28 '16 at 15:10
@zencv That's a good point to think about, thank you. I wasn't going to attempt to change the old databases. Basically I opened up this thread with, "I am approaching a project where I'll be having to implement a database" because it's the _new_ database that I want to start off on a good foot. In the case that there ended up being a point that the boss and I disagree on how to implement the DB, I wanted to make sure I was armed with convincing reasons for taking the time to normalizing it as opposed to just shoving every attribute for every object all in one DB like the other DB's. — 8protons, Apr 28 '16 at 15:21
@8protons Normalization is like creating a form. It insures that all the necessary data is recorded accurately so the DB reflects the reality. *Critical* -- Create VIEWS that present the data in the denormalized form modern programmers want. Allow reading only through those views. Design transaction tables (or even flat files) for all writes (insert/update/delete). Use triggers and stored procedures to copy that data to the various base tables. The T-tables give you some queuing and rollback capabilities as well. Build in queuing for the bottlenecks that *will* happen. — DocSalvager, Apr 29 '16 at 02:57
"But I take the general point that you could write a tome on it if you were so inclined." It's already been written. [*An Introduction to Database Systems*](http://www.amazon.com/Introduction-Database-Systems-8th/dp/0321197844?ie=UTF8&keywords=an%20introduction%20to%20database%20systems&qid=1462159836&ref_=sr_1_1&s=tradein-aps&sr=8-1&srs=9187220011) by [C.J. Date.](https://en.wikipedia.org/wiki/Christopher_J._Date) He invented SQL and [Edgar Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) invented the relational model. — radarbob, May 02 '16 at 03:32
*Why does a good relational schema improve a database?* Data integrity. How one gets there is through normalization, in part. And performance is a result of 40 years of database engine improvements. sorry R.H. and Phillip but pedanticosity has gotten hold. — radarbob, May 02 '16 at 03:34

Philipp · Accepted Answer · 2017-05-27T11:53:20.397

70

The performance argument is usually the one which is most intuitive. You especially want to point out how it will be difficult to add good indexes in an incorrectly normalized database (note: there are edge-cases where denormalization can in fact improve performance, but when you are both inexperienced with relational databases you will likely not easily see these cases).

Another is the storage size argument. A denormalized table with lots of redundancies will require far more storage. This also plays into the performance aspect: the more data you have, the slower your queries will be.

There is also an argument which is a bit harder to understand, but is in fact more important because you can't solve it by throwing more hardware at it. That's the data consistency problem. A properly normalized database will take care by itself that a product with a specific ID always has the same name. But in a denormalized database such inconsistencies are possible, so special care needs to be taken when it comes to avoiding inconsistencies, which will take up programming time to get right and will still cause bugs which will cost you in customer satisfaction.

edited May 27 '17 at 11:53

answered Apr 26 '16 at 15:10

Philipp

23,166
6
61
67

19

One major edge case for denormalization is _data warehousing_, specifically, if you have a large amount of data that is guaranteed never to change and you want to query it more quickly and efficiently at the expense of storage space. Good answer, this is just an FYI for any SQL newbies who are not sure why anything other than 3NF would be desirable. – Apr 26 '16 at 17:07
1

@Snowman "just an FYI for any SQL newbies who are not sure why anything other than 3NF would be desirable" aka like myself haha. This was a great addendum, thank you. – 8protons Apr 26 '16 at 17:12
Could you define "normalization" in this context? – user1717828 Apr 26 '16 at 17:38
3

@user1717828 https://en.wikipedia.org/wiki/Database_normalization – Philipp Apr 26 '16 at 17:51
11

I'm not sure why the consistency argument is "harder to understand." It seems much simpler to me: if a value changes, then all copies of that value must be updated. Updating a single copy is much less error prone than updating hundreds or thousands of copies of the same data. This applies equally well to *relationships between* data. (If I have the relationship stored two ways, I have to update both copies of the relationship.) This is an *extremely* common problem in denormalized DBs; it's *very* difficult to prevent this corruption in practice (an exception is materialized view type usage). – jpmc26 Apr 27 '16 at 01:03
Lack of normalization is also why one runs into enormous problems very quickly in JSON caches like MongoDB (a tale about that [here](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/)). – David Tonhofer Apr 27 '16 at 06:09
4

That last paragraph should be highlighted in bold. :-) Without normalization it is impossible to guarantee data integrity. Controlling input solely at the Business Logic layer is a fool's errand as *every* non-normalized database eventually exhibits some sort of data anomaly. – Apr 27 '16 at 20:21
@Snowman Data Warehousing denormalization can - and in general should - be carefully managed by the proper separation of facts from dimensions. Sales in the DW would have dimensions for Store, Company, Product, Sale Date, etc. and facts for the amounts involved in the sale. This single-table thing is just... wrong. – Corey Apr 28 '16 at 03:13
@Corey yes, the tables described in the question are awful. Nobody is debating that fact. I was referring to denormalization in general: there are tradeoffs, and sometimes it makes sense. Especially when data is set in stone and not being updated, reducing the risk of errors while increasing performance. – Apr 28 '16 at 03:16
One argument to de-normalize is to store the values as they are. Imagine an online store. You have the item "Pink Socks" with ID 3. Then, someone orders those socks. Instead of you having a relationship, you store a copy of the name. Imagine that, 10 months from now, your "Pink Socks" are unavailable. And you delete the item. Imagine the chaos when you try to make an annual report. – Ismael Miguel Apr 28 '16 at 10:24
2

@IsmaelMiguel Usual practice is that master-data like this is never hard-deleted from the database. You only soft-delete it by setting a flag on it which says it is no longer available. In this particular case it would be a good idea to have a foreign key relationship between products and orders which means that the database will throw an error when you try to delete a product which is referenced by any orders. – Philipp Apr 28 '16 at 10:39
@Philipp I totally agree with the "soft-delete" part. Sometimes, it just isn't practical. Keeping with the store, instead of having just an ID to the details of the client, one can copy some of the details. Delivery details and billing details for **that** very specific order should be copied over. One may want to buy something and have it delivered to this place for someone else. Without copying over the details, you will need a table of entities. But when you expect this to be a once-every-blue-moon thing, the new table is added complexity. – Ismael Miguel Apr 28 '16 at 12:50

score 24 · Answer 2 · answered Apr 26 '16 at 15:26

24

I'll be having to implement a database with my boss ...

Using dedicated Database Management software might be considerably easier (sorry; couldn't resist).

lngStoreID | vrStoreName | lngCompanyID | vrCompanyName | lngProductID | vrProductName

If this database only cares about "logging" which product was sold where, when and by whom, then you might be able to stretch the Definition of "OK database" far enough to cover it. If this data is being used for anything else, then it's really pretty poor.

But ...

Does the application/ queries using this data respond poorly/slowly? If not, then there's no real Problem to solve. Sure, it looks and feels ugly, but if it works then you're not going to get any "points" for suggesting it "could" be better.

If you can find definite symptoms (i.e. problems) that look like they're caused by poor data modelling, then prototype a better solution. Take a copy of one of these "databases", normalise the data and see if your solution runs any better. If it's considerably better (and I would fully expect that any update operations on this data would be massively improved) then go back to your boss and show them the improvement.

It's perfectly possible to recreate his "single-table view" of the data with .. well .. Views.

answered Apr 26 '16 at 15:26

Phill W.

11,891
4
21
36

11

Resistance to the single table weltanschauung often comes from those inexperienced with SQL who don't understand joins - especially with regards to missing data i.e. outer joins. – Robbie Dee Apr 26 '16 at 15:56
6

@RobbieDee More commonly, it's from people who have seen denormalized data get corrupted by becoming inconsistent. I am one such person. I would only consider this kind of structure in the situation Phill suggests: this is some kind of logging/reporting table where data will *never* be updated or only updated by being wiped clean and completely re-derived from other sources. – jpmc26 Apr 27 '16 at 01:05
2

Even if the application performs acceptably with a database like this, it's still not flexible like a properly normalized database. If the store name or company name change, it will have to be updated everywhere, instead of in just a store or company table. In some cases, that may actually be what you want (such as if the data is mainly collected for archival purposes), but we'd need to know more about the specific application. – Zach Lipton Apr 27 '16 at 05:36
1

@Zach: agreed, that's why the sales log is potentially an acceptable case for this. Supposing you want each sale to be associated with whatever the store was named at the time the sale was made, *not* "the current name of the store", then attempting to "normalise" this introduces some considerable complexity (because the table recording store names would need to be a series over time, not just one value per storeid) – Steve Jessop Apr 27 '16 at 18:02
Perhaps a rule of thumb would be that if the only complexity introduced by a proposed normalisation is that a few queries now need joins in them to pick up all the columns they need to report, then you should run not walk to make that change :-) – Steve Jessop Apr 27 '16 at 18:05
@SteveJessop Strictly speaking, your data *is* normalized in that situation. To split things up when normalizing, you have to have a functional dependency that doesn't include a candidate key, and in the case you describe, the dependency doesn't exist. It happens that a lot of your data has the same value, but that's different from being able to determine the value based on others for all data and all time. – jpmc26 Apr 29 '16 at 04:26
@jpmc26: Well, the "store name" depends only on "store" and "time of sale", but not on "cash register used", and so it is a functional dependency on a proper subset of a candidate key, so the table is not in 2NF? Or maybe I'm wrong, I never thoroughly understood this. If each store only has one cash register, so that "store" + "time" is a candidate key and the only such, then we might be OK :-) – Steve Jessop Apr 29 '16 at 08:56

score 14 · Answer 3 · edited Apr 28 '16 at 13:39

Why does a good relational schema improve a database?

The answer is: it doesn't always improve a database. You should be aware that the what you were likely taught is called Third Normal Form.

Other forms are valid in some situations, which is key to answering your question. Your example looks like First Normal Form, if that helps you feel better about its current state.

3NF rules establish relationships among data that "improve" a database:

Prevent invalid data from entering your system (if a relationship is 1-to-1 it forces an error despite the code written on top of it). If your data is consistent in the database, it is less likely to result in inconsistencies outside your database.
It provides a way to validate code (e.g. a many-to-one relationship is a signal to restrict an object's properties/behaviors). When writing code to use the database, sometimes programmers notice the data structure as an indicator of how their code should work. Or they can provide useful feedback if the database doesn't match their code. (This is more like wishful thinking, unfortunately.)
Provide rules that can significantly help you reduce mistakes when building a database, so that you are not building it based on arbitrary requirements that might come at any time during the life of a database. Instead, you are systematically evaluating the information to achieve specific goals.
Proper database structures lead to improved performance by connecting data in ways that minimize data storage, minimize storage calls to retrieve data, maximize in-memory resources and/or minimize data sorting/manipulation for the particular dataset you have, compared to the query you are executing against it. But "proper" structure depends on the amount of data, nature of the data, type of query, system resources, etc. By normalizing you may make performance worse (i.e. if you load all of the data as 1 table - joining can slow down a query). Transaction processing (OLTP) vs business intelligence (data warehouse) are very different.

In a small company with small data sets, you might find that there is nothing wrong with the way it is now. Except, if you grow, it will be a pain to "fix" later, because as the table gets large, the systems that use it will likely go slower.

Usually you will want to emphasize fast transactions as a company grows. However, if you spend time on this project now instead of other things that the company may need more urgently, you may never have that problem because your company never really grows. That is the "pre-optimization challenge" - where to spend your precious time right now.

Good luck!

Unmentioned but I think an important point to programmers is that editing one "thing" requires editing of only a single row rather than having to loop the entire database to find and replace that single thing. — slebetman, Apr 26 '16 at 23:28
@slebetman You should never have a code side loop to update multiple rows in a single table, regardless of whether it's normalized. Use a `WHERE` clause. Of course, these can still go wrong, but it's less likely in a normalized situation since you only have to match one row via primary key. — jpmc26, Apr 28 '16 at 05:14
@jpmc26: By looping the database I mean constructing a query to update all affected rows. Sometimes a single WHERE suffice. But I've seen unholy structures that requires subselects into the same table to get all affected rows without affecting rows that should not change. I've even seen structures where a single query can't do the job (the entity that needs change resides in different columns depending on row) — slebetman, Apr 29 '16 at 02:02
Many excellent answers to this question, and this was no exception. — Mike Chamberlain, Apr 29 '16 at 08:13

score 11 · Answer 4 · edited Apr 27 '16 at 14:12

There a multiple reasons why using one large "god table" is bad. I'll try and illustrate the problems with a made up example database. Let's assume you are trying to model sporting events. We will say you want to model games and the teams playing in those games. A design with multiple tables might look like this (this is very simplistic on purpose so don't get caught up in places where more normalization could be applied):

Teams
Id | Name | HomeCity

Games
Id | StartsAt | HomeTeamId | AwayTeamId | Location

and a single table database would look like this

TeamsAndGames
Id | TeamName | TeamHomeCity | GameStartsAt | GameHomeTeamId | GameAwayTeamId | Location

First, let's look at making indices on those tables. If I needed an index on the home city for a team, I could add it to the Teams table or the TeamsAndGames table pretty easily. Remember that whenever you create an index, that needs to be stored on disk somewhere and updated as rows are added to the table. In the case of the Teams table this is pretty straightforward. I put in a new team, the database updates the index. But what about for TeamsAndGames? Well, the same applies from the Teams example. I add a team, the index gets updated. But it also happens when I add a game! Even though that field will be null for a game, the index still has to be updated and stored on disk for that game anyway. For one index, this doesn't sound too bad. But when you need many indices for the multiple entities crammed into this table, you waste a lot of space storing the indices and a lot of processor time updating them for things where they don't apply.

Second, data consistency. In the case of using two separate tables, I can use foreign keys from the Games table to the Teams table to define which teams are playing in a game. And assuming I make the HomeTeamId and AwayTeamId columns not nullable, the database will ensure that every game I put in has 2 teams and that those teams exist in my database. But what about the single table scenario? Well, since there are multiple entities in this table, those columns should be nullable (you could make them not nullable and shove garbage data in there, but that is just a horrible idea). If those columns are nullable, the database can no longer guarantee that when you insert a game that it has two teams.

But what if you decide to just go for it anyway? You set up the foreign keys such that those fields point back to another entity in the same table. But now the database will just make sure that those entities exist in the table, not that they are the correct type. You could very easily set GameHomeTeamId to the ID of another game and the database won't complain at all. If you tried that in the multiple table scenario, the database would throw a fit.

You could try to mitigate these issues by saying "well, we will just make sure we never do that in code". If you are confident in your ability to write bug free code the first time and in your ability to take into account every strange combination of things a user might try, go right ahead. I personally am not confident in my ability to do either of those things, so I'll let the database give me an extra safety net.

(This gets even worse if your design is one where you copy all relevant data between rows instead of using foreign keys. Any spelling / other data inconsistencies will be hard to resolve. How can you tell if "Jon" is a misspelling of "John" or if it was intentional (because they are two separate people)?)

Third, almost every column needs to be nullable or must be filled with either copied or garbage data. A game doesn't need a TeamName or TeamHomeCity. So either every game needs some kind of placeholder in there or it needs to be nullable. And if it is nullable, the database will happily take a game with no TeamName. It will also take a team with no name, even if your business logic says that should never happen.

There are a handful of other reasons why you would want separate tables (including preserving developer sanity). There are even a few reasons why a larger table might be better (denormalization sometimes improves performance). Those scenarios are few and far between (and usually best handled when you have performance metrics to show that that is really the problem, not a missing index or something else).

Finally, develop something that will be easy to maintain. Just because it "works" doesn't mean it's OK. Trying to maintain god tables (like god classes) is a nightmare. You are just setting yourself up for pain later.

"Teams: Id | Name | HomeCity". Just make sure your data schema doesn't make your application incorrectly claim that Super Bowl XXXIV was won by the LA Rams. Whereas SB XXXIV *should* appear in a query for all championships won by the team currently known as the LA Rams. There are better and worse "god tables", and you've certainly presented a bad one. A better one would be "game ID | home team name | home team city | away team name | away team city | game starts at | etc...". Which comes about as a first attempt to model information like "New Orleans Saints @ Chicago Bears 1p Eastern". — Steve Jessop, Apr 27 '16 at 18:22

score 6 · Answer 5 · answered Apr 26 '16 at 22:41

Quote of the day: "Theory and practice should be the same... in theory"

Denormalized table

Your unique hold-it-all table contains redundant data has one advantage: it makes reporting on its lines very simple to code and fast to execute because you don't have to do any joins. But this at a high cost:

It holds redundant copies of relations (e.g. IngCompanyID and vrCompanyName). Updating master data might require to update many more lines than in a normalized schema.
It mixes everything. You can't ensure an easy access control at database level, e.g. ensuring that user A can update only company info, and user B only product info.
You can't ensure consistency rules at database level (e.g. primary key to enforce that there's only one company name for a company id).
You don't fully benefit from the DB optimizer who could identify optimal access strategies for a complex query, taking advantage of size of normalized tables and statistics of several indexes. This might quickly offset the limited benefit of avoiding joins.

Normalized table

The disadvantages above are advantages for the normalized schema. Of course, the queries might be a little more complex to write.

In short, the normalized schema expresses much better the structure and relations between your data . I'll be provocative and say it's the same kind of difference than between the discipline required to use a set of ordered office drawers and the ease of use of a trash bin.

score 5 · Answer 6 · answered Apr 27 '16 at 08:06

I think there are at least two parts to your question:

1. Why shouldn't entities of different types be stored in the same table?

The most important answers here are code readability and speed. A SELECT name FROM companies WHERE id = ? is just so much more readable than a SELECT companyName FROM masterTable WHERE companyId = ? and you are less likely to accidentially query nonsense (e.g. SELECT companyName FROM masterTable WHERE employeeId = ? wouldn't be possible when companies and employees are stored in different tables). As for speed, data from a database table is retrieved either by reading the full table sequentially, or by reading from an index. Both are faster if the table/index contains less data, and that's the case if the data is stored in different tables (and you only need to read one of the tables/indices).

2. Why should entities of a single type be split into sub-entities that are stored in different tables?

Here, the reason is mostly to prevent data inconsistencies. With the single table approach, for an order management system you might store the customer name, customer address and product ID of the product the customer ordered as a single entity. If a customer ordered multiple products, you'd have multiple instances of the name and address of the customer in your database. In the best case, you just got duplicate data in your database, which may slow it down a bit. But a worse case is that someone (or some code) made a mistake when the data was entered so that a companies ends up with different addresses in your database. This alone is bad enough. But if you were to query a company's address based on its name (e.g. SELECT companyAddress FROM orders WHERE companyName = ? LIMIT 1) you'd just arbitrarily get one of the two addresses returned and wouldn't even realize that there was an inconsistency. But each time you run the query you may actually get a different address, depending on how your query is resolved internally by the DBMS. This will likely break your application somewhere else, and the root cause of that breakage will be very hard to find.

With the multi-table approach, you'd realize that there is a functional dependency from company name to company address (if a company can have only one address), you'd store the (companyName, companyAddress) tuple in one table (e.g. company), and the (productId, companyName) tuple in an another table (e.g. order). A UNIQUE constraint on the company table could then enforce that each company only has a single address in your database so that no inconsistency for company addresses could ever arise.

Note: in practice, for performance reasons you'd probably generate a unique companyId for each company and use it as a foreign key instead of using the companyName directly. But the general approach stays the same.

score 3 · Answer 7 · answered Apr 28 '16 at 03:14

TL;DR - They're designing the database based on how they were taught when they were in school.

I could have written this question 10 years ago. It took me some time to understand why my predecessors designed their databases the way they did. You're working with someone that either:

Gained most of their database design skills using Excel as a database or
They're using the best practices from when they got out of school.

I don't suspect it's #1 since you actually have ID numbers in your table, so I'll assume #2.

After I got out of school, I was working for a shop that used an AS/400 (aka IBM i). I found some odd things in the way they designed their databases, and began advocating we make changes to follow how I was taught how to design databases. (I was dumb back then)

It took a patient older programmer to explain to me why things were done that way. They hadn't changed the schema because it would have caused programs that were older than me to break. Literally, the source code for one program had a creation date of the year before I was born. On the system we were working on, their programs had to implement all the logic and operations that your database's query planner handles for you. (You can see that by running EXPLAIN on one of your queries)

He was up-to-date on techniques I was trying to implement, but keeping the system running was more important than making changes "because it went against what I was taught". Every new project either of us started made best use of the relational model that we were able to. Unfortunately, other programmers/consultants from that time still designed their databases as if they were working with the former constraints of that system.

Some examples of what I encountered that didn't fit the relational model:

Dates were stored as Julian day numbers which required a join to a date table to get the actual date.
Denormalized tables with sequential columns of the same type (e.g. code1,code2, ..., code20)
NxM length CHAR columns representing an array of N strings of length M.

The reasons that I was given for those design decisions were all based on the constraints of the system when the database was first designed.

Dates - I was told it took more processing time to use date functions (which month or day or weekday) to process a date than it did to create a table of every possible date with all of that information.

Sequential columns of the same type - The programming environment they were in allowed a program to create an array variable over part of the row. And it was an easier way to reduce the number of read operations.

NxM Length CHAR columns - It was easier to shove configuration values into a one column to reduce file read operations.

A poorly conceived example in C equivalent to reflect the programming environment they had:

#define COURSE_LENGTH 4
#define NUM_COURSES 4
#define PERIOD_LENGTH 2

struct mytable {
    int id;
    char periodNames[NUM_COURSES * PERIOD_LENGTH];  // NxM CHAR Column
    char course1[COURSE_LENGTH];
    char course2[COURSE_LENGTH];
    char course3[COURSE_LENGTH];
    char course4[COURSE_LENGTH];
};

...

// Example row
struct mytable row = {.id= 1, .periodNames="HRP1P2P8", .course1="MATH", .course2="ENGL", .course3 = "SCI ", .course4 = "READ"};

char *courses; // Pointer used to access the sequential columns
courses = (char *)&row.course1;


for(int i = 0; i < NUM_COURSES; i++) {

    printf("%d: %.*s -> %.*s\n",i+1, PERIOD_LENGTH, &row.periodNames[PERIOD_LENGTH * i], COURSE_LENGTH,&courses[COURSE_LENGTH*i]);
}

Outputs

1: HR -> MATH
2: P1 -> ENGL
3: P2 -> SCI
4: P8 -> READ

According to what I was told, some of this was considered best practice at the time.

Why does the relational model for a database matter?

7 Answers7