-1

Let's say I have a astonishingly big table (300 columns) with lot's of records (in the millions).

This table is overly complicated to use. I'd like to refactor this table and split it in multiple more meaningful tables.

I have only surface knowledge of database: I'm more a user than an database developer. What are the difficulties when having to split large tables with lots of entries and how to proceed properly?

I've tried googling it but I didn't find information on how to split a table. Can you propose an approach or some advices on how to proceed?

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
JayZ
  • 827
  • 5
  • 11
  • 1
    Having a lot of columns in a single table instead of spreading them over multiple normalized tables isn't necessarily bad design. There are many cases where this might improve performance a lot. Instead of splitting it into multiple tables you could consider to provide views on it which makes some BL aspects easier to handle. – πάντα ῥεῖ May 07 '20 at 11:28
  • No one in the team has background in database engineering, so we're mainly doing wild guess. However speed is not an issue and we're often quering all fields when only a handful would suffice. Plus, as I said, it's a pain to deal with with SQLDevelopperL – JayZ May 07 '20 at 12:23
  • @πάνταῥεῖ How to evaluate OP’s very practical argument in this regard: "This table is overly complicated to use” ? – Christophe May 07 '20 at 17:56
  • @Christophe If there are dependencies about row interpretations regarding certain column values, these could be hidden in specialized views. – πάντα ῥεῖ May 07 '20 at 18:36

2 Answers2

2

Such a step requires to take several aspects into account:

  1. Database Normalization

    is about how to split the data properly in a way with fewer redundancies. The challenge here is, when you already have an unnormalized system, with lots of redundant data, it may be really hard to bring the data into a less redundant state. Redundancy usually goes along with inconsistencies, and you have to make decisions about which of two different representations of information is the "correct" one.

  2. Schema and data migrations

    are techniques to actually change the database structure without data loss, for example, by using "ALTER TABLE" or "UPDATE" SQL statements . The challenge here is to apply them error-free to a larger production DB, because if a migration fails, this can ruin the whole database, which means one needs to make a full recovery from the latest backup. And you don't want to repeat such a step several times in production in case the migration and recovery take a while.

  3. Compatibility issues / dependencies.

    A 300 column table is likely to be used in several places of a large system. The challenge here is to find all those places and adapt them to the new structure properly. This can be become extremly hard when dealing with a legacy code base with several millions of LOC, as you mentioned in your other question. I would bet such code does not have a compile-time safety net for structural database changes, so one will need tons of tests to make sure not to break anything.

    This aspect is also coupled to the question of making the changes gradually. If you have only a few places in code where the data is written (ideally just one), but several places where it is read, then introducing views might be of help. That will allow at least the code which reads the data to be migrated to a new structure in small steps.

  4. Performance.

    Splitting a 300 column table into several smaller ones will usually effect performance, sometimes for the better, sometimes for the worse. This depends heavily on the queries and access patterns. So one needs to be careful not to introduce unexpected performance bottlenecks, which can only be determined by testing and mesasuring.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
2

Complex is not the same as complicated:

  • complex means that there are many parts;
  • complicated means that it is difficult to understand;

Complex tables but simple design

Complexity is not bad. A table with many columns may be perfectly legitimate and simply represent items / facts / events with many attributes:

  • The number of columns just show the wealth of data collected on a complex reality;
  • The table could nevertheless stay simple to understand: just focus on the columns of interest, select the relevant rows, aggregate them and/or sort them.

Complex table and complicated use

When complexity starts to be an issue, it may just be because the number of parts exceeds our capacity. In this case, the most suitable strategy is “Divide et impera” as the Roman emperor Julius Caesar used to say:

  • If many queries appear similar, just create views. Views store a base query and allows you to use it as if they were smaller tables. This makes repetitive querying much simpler. The real storage of data remains the unchanged big table.
  • If groups of column are used very differently, the table could be divide into several lighter tables with less columns. The only constraint is that you find some redundant data stored in both tables that allows to reassemble the rows (with a join). This kind of splitting is delicate and you'll also need to adapt the data feeding into your database.

Too complicated to use: bad design?

Often, though, the large number of columns appears to reveal a bad design that was made unnecessarily complicated. Generally this situation grows historically: initially the design and the table is simple. Columns are added over and over again without too much thought. And one day the table's a monster.

The two main issues are:

  • Repetitive columns: this happens when data is squeezed into a single table instead of being better structured as separate but related tables. Data is simple to feed. Some queries are very simple. But many queries become overly complex due to a long enumeration of columns. And advanced queries become difficult to automate

  • Merging unrelated data: a table is meant for storing related data. If a part of the column is independent of another part, you have mixed appels and oranges and some data is repeated unnecessarily.

In both cases, the cure is simple: put together what belongs together, but keep separate independent things. In database jargon, this exercise is called normalisation.

How to do it?

As explained, just identify the independent data to put it in separate tables:

  • for repetitive columns, the trick is to find a missing parameter explaining the repetition, and extract the relevant data into a separate table with where the repetitive columns become lines (e.g. 12 lines instead with a month number, instead of 12 columns that corresepond each to a specific month).
  • for independent data, the trick is to map the columns to independent things (e.g. product and product name belong together, product and monthly result also become together, but product name is not directly related to monthly results, so keep it in a separate table).

Changing a database scheme is not easy. ANd you need to completely overhall the way the data is captured in the database. Carrying out the change can be done in steps according to the following genral approach (to be tailored for your needs):

  • Create views for the independent things.
  • Create a join on all these views to verify that you are able to find back the same data as in the big table.
  • Create some more advanced views for all the views contain repetitive columns to show the data as lines (hint: requires often some SQL UNION).
  • Create some queries to verify that you can assemble aggregates on these advanced views (e.g. total per product) with the rest of the data (e.g. product table), and check if you can find the same aggregates from the big table.
  • If all these verifications are successful, you should set your DB offline, and copy the data of each view into its own table, for example using CREATE TABLE AS SELECT.
  • Rename your big table, and once you are sure that everything is fine, and no query non longer rely on it, drop it.
Christophe
  • 74,672
  • 10
  • 115
  • 187