2

I was following a tutorial on the normal forms of SQL databases, and I got confused landing on this example : https://www.tutorialspoint.com/sql/third-normal-form.htm.

From

CREATE TABLE CUSTOMERS(
   CUST_ID       INT              NOT NULL,
   CUST_NAME     VARCHAR (20)      NOT NULL,
   DOB           DATE,
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   ZIP           VARCHAR(12),
   EMAIL_ID      VARCHAR(256),
   PRIMARY KEY (CUST_ID)
);

It creates a new table ADDRESS like this because there is a "transitive dependency between zipcode and address".

CREATE TABLE ADDRESS(
   ZIP           VARCHAR(12),
   STREET        VARCHAR(200),
   CITY          VARCHAR(100),
   STATE         VARCHAR(100),
   PRIMARY KEY (ZIP)
);

This is where I am really confused. Why use the zipcode as a primary key? Primary keys have to be unique, can't you have two addresses with the same zipcode?

Either

  • I don't understand how a zipcode work
  • I don't understand how a primary key work
  • This example is clearly wrong
  • I don't understand something else.
Ricola
  • 269
  • 1
  • 3
  • 7
  • 2
    its not the primary key for the address, its a primary key for the zipcode table. a zipcode will only have one street,city and state (at least thats what its asserting, not sure if its true) – Ewan Nov 28 '18 at 16:18
  • 4
    This tutorial is confusing because the text says to more it to a Zip code table, but the code example is the `ADDRESS` table. This tutorial is confusing and I think its wrong. – DFord Nov 28 '18 at 16:23
  • 2
    The zip code is a primary key only if you will ever deal with postal codes from one country. – Greg Burghardt Nov 28 '18 at 16:28
  • 1
    Also, required reading: [Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/). It's old, but has good information. – Greg Burghardt Nov 28 '18 at 16:30
  • And since you've jumped head first into the black hole that is modeling addresses in a database, checkout out [What problems are solved by splitting street addresses into individual columns?](https://dba.stackexchange.com/q/133546/72417). I had to jump down this black hole a couple of years ago. – Greg Burghardt Nov 28 '18 at 16:32
  • 2
    Good for you for spotting that issue because the example in that “tutorial” is totally bonkers. It makes no sense to use a ZIP code as a primary key. Maybe they wanted to use the CUST_ID as primary key of the address table and just got confused, but that doesn't make any sense either. Perhaps just ignore that site… – amon Nov 28 '18 at 16:38
  • 1
    Possible duplicate of [What's a universal way to store a geographical address/location in a database?](https://softwareengineering.stackexchange.com/questions/357900/whats-a-universal-way-to-store-a-geographical-address-location-in-a-database) – gnat Nov 28 '18 at 16:52
  • @gnat perhaps rather a duplicate of questions about normalization and how far to go when normalizing data. – Bent Nov 28 '18 at 16:57
  • 1
    The example is not just wrong, its very wrong. – GrandmasterB Nov 28 '18 at 17:21
  • Could this be using zip+4? Still a bad idea IMO but somewhat workable. The field is big enough for that. – JimmyJames Nov 28 '18 at 17:34
  • 1
    Notice that zipcode is defined as 12 characters long. In the US, at present, there are 5 digit zipcodes and 9 digit zipcodes. Presumably,the 9 digit zipcodes identify a mailbox or slot, and determines street, cty and state. This is local to the US. If you extend your model to even Canada, the name Zipcode is highly misleadng, and the table structure may be dysfunctional. – Walter Mitty Nov 29 '18 at 11:13

3 Answers3

15

The example is making a fundamental mistake: it's using data as a primary key. It should create and use unique IDs.

The comments debate how correct it is to assume that a zipcode maps to a particular street. Whether that's correct or not, the simple fact is that for this to work without a unique ID it must be correct, not only now but forever more. That is exactly why it's wrong to do this. You can't possibly know the future. Use a unique ID.

If zipcode uniquely identifies the data you're normalizing now then you have a natural key. But adding just one more record can destroy that. Natural keys can be used when importing data to help build unique ID relations. They should not be used when an application is collecting data from a user that can ensure the relationship is real.

People get these two use cases of structuring data confused all the time. Unique ID's should always be preferred in operational systems. The problem is they don't always exist. When they don't we can construct uniqueness by selecting data fields to be natural keys as we normalize. But that constructed uniqueness is ALWAYS brittle. It might only be true now. It's ok to use the fact that it's true now to generate unique IDs now. But after that new data should be assigned unique ID's properly.

Unique ID's don't erode as more data is added. Natural keys often do. Developers who field systems that insist that their natural key assumptions hold regardless of reality often cause problems that operators find themselves having to work around. Please don't do that.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • There is no reason why keys have to last forever. What matters is whether enforcing a certain functional dependency will improve the quality of your data. If the rule "zipcode determines street" is a useful data integrity rule then it makes sense for zipcode to be a key - perhaps not the *only* key in the table but at least it should be a key. I don't know if that is a useful rule in this case but it could be. Of course the rule could change in future but that doesn't mean it's an invalid choice today – nvogel Dec 02 '18 at 22:49
  • @nvogel the reason keys need to last forever is that I hate getting 2am phone calls because the DB is dumping records. – candied_orange Dec 02 '18 at 23:23
  • Not a "fundamental mistake" then but more to do with your personal situation. As you rightly say, developers' assumptions should never determine natural keys - and that includes assumptions about what the keys should not be as well as what they should be. Business rules should be the deciding factor. Natural keys implement business functional requirements by enforcing important dependencies, keeping bad data out and ensuring the reliability of downstream processes based on the data. – nvogel Dec 05 '18 at 06:33
  • @nvogel "Natual keys implement business functional requirements" no they don't. That's the point. We make assumptions about the structure of data, start counting on those assumptions, and get disappointed. Data is data. It does as it pleases regardless of our business needs. The objective isn't to keep data we didn't anticipate out. It's to keep surprises from destroying the system. UniqueIDs keep on working regardless of what reality does. Their only job is to give the DB structure. Nothing else. That makes them very resilient. – candied_orange Dec 05 '18 at 11:35
  • *I'm sorry we charged you twice for your order, Mr Customer. Our developers believe that business needs are just assumptions, so they allow duplicate entries in our system.* CUSTOMER: *Your developers can assume I will be taking my business elsewhere in future!* – nvogel Dec 05 '18 at 13:12
  • @nvogel If we screw up and charge a customer twice we damn well better have records of it. Business rules belong in the application, not the database. A database's job is to record data. Not to make decisions. – candied_orange Dec 05 '18 at 17:03
  • Then it seems you do agree that business rules *should* be implemented in software. Rules are clearly very important for software to be effective and should not be just programmers' assumptions. In some cases (PII/personal, health, banking) there are regulations and criminal penalties for not applying rules, including identifiers. There's room for discussion about *where* business rules should be located: database/integration layer/rules engine/application. The best general answer is "it depends". However, the fact that a rule may change is not by itself a good reason for not implementing it. – nvogel Dec 06 '18 at 09:57
4

Using a zip code as a primary key would be incorrect. The author of the tutorial is correct in that tables should be carefully looked at to determine what can be split apart to reduce large quantities of duplicate data in the grand scheme of things. However, a zip code isn't unique as more than one customer may (and most likely would) live in the same zip code. Even if the extra 4 digits the post office uses are added it is still not always unique. The address itself would also not be unique, as more than one resident in a household could be a customer. In fact the best course of action would be to create a unique id for the table, if the address is going to split away like this, as any combination for a composite key would still not remove the possibility of duplicate keys. The tutorialspoint example is not a good example or correct example of 3nF.

The basics to look for in each normalization form are this:

  • 1nF: Make sure there aren't duplicate columns (horizontal).
  • 2nF: break tables up until there is only one purpose. Like breaking apart a customer table that includes orders so that customers and orders are separated.
  • 3nF: "Transitive" basically just means can one column be determined by another without looking at the primary key. Like a customer order table may have the orderID, customer, manufacturer, and product as columns. The product may be determined by the manufacturer, so the product column or vice versa doesn't actually have to rely on the order number. This would be a table where it could be split into two tables where the productID or manufacturer can be the primary key. The new tables primary key would then be used as a foreign key in the order table.
  • 4nF: Make sure there is only on piece of data in each column. Let's use the manufacturer table. If the manufacturer column is the primary key, there could be more than one product in the same row/column. I guess this isn't the best example, but I hope you get the idea. So for 4nF one would make sure that having more than one piece of data, like multiple products, in the same spot doesn't happen.

3nF and 4nF aren't always used by companies as strict standards, but they are good to know and use when possible. Also, like others have mentioned using an ID as a primary key instead of one of the columns can be very useful. For example, instead of making a composite key out of the customers first name, last name, and user name or storing sensitive information like their SSN, an auto-generated ID can be used instead as the primary key.

3

The example is clearly wrong.

It is a good example of over-normalization. Where you by normalizing everything theoretically possible create a solution that does not hold in the future.

Zip codes even those in the US that might look rather permanent (I'm not an expert in US zip codes but I give the author of the tutorial the benefit of the doubt) and may define the street are not a good key for addresses in general.

In some countries it does make sense to have a zip code table to help the user enter an address, but the address needs to be save for each record, not to be referenced through a key.

But streets changes names, are rebuild, split into two zip codes when new housings are built along them etc.

In a little time one of the addresses in your database is going to change, perhaps the street name is changed. Now you edit the entry for the zip code and all the other addresses using that zip code are now wrong and you will later have no idea why, and you no longer have a clue of what the address was before the change.

Where I grew up you could derive the zip code from the first 4 digits of the phone number. If you go by the logic of the author of the tutorial both could not be in the customers table.

Bent
  • 2,566
  • 1
  • 14
  • 18
  • 2
    Check out https://postalpro.usps.com/postal-bulletin-changes. The USPS makes changes to zip codes multiple times per year. I think for the USPS, Postal Code = Post Office. If they close or open a new post office, that will cause zip codes to change. – Greg Burghardt Nov 28 '18 at 16:43
  • In Germany, after the re-unification, for the first short time period, ZIP codes were not unique (there were places in the former GDR and the FRG that had the same ZIP code), and then they were changed (expanded from 1-4 digits to a fixed 5 digits). – Jörg W Mittag Nov 28 '18 at 16:53