What are the best practices around retiring obsolete database columns?

Question

I'm designing an application which will in an early stage collect data A, B, and C from clients, but later on will instead collect data A, B, and D.

A, B, C, and D are very related and right now exist as columns of a single database PostgreSQL table T.

Once C is no longer needed, I want to remove its references from my application (I use the Django ORM), but I want to keep the data that was already entered. What is the best way to do so?

I've thought of creating a new table for ABD, but that means that might cause issues with any rows referencing table T.

I could just leave column C along, and remove references to it in the code, allowing the existing data to survive.

Is there a better option I'm not seeing?

Some extra details:

The number of rows will not be big, most likely 1-2 per user. This is a mass market application, but by the time I switch from C to D, the userbase will not be very large yet. C and D will likely not be collected at the same time, although that is a possibility. C and D likely represent multiple columns each, not just one each.

I think the correct way to approach this depends on if you require to distinguish between rows which have been collected from {A,B,C}, and those collected from {A,B,D}, and if yes, if your current data model allows this. And it will also depend on what you are going to do with those rows collected from {A,B,C} - the new version of the application shows them as {A,B,D} with an empty "D", but a user does not see the content of column C, he might be tempted to delete that row from the db (if the app allows deletion of rows), since he does not see the content. — Doc Brown, Jan 11 '18 at 11:39
**[Why is asking a question on “best practice” a bad thing?](https://meta.stackexchange.com/a/142354/)** — , Jan 11 '18 at 16:30
Are there ever any rows with C and D collected at the same time? Or will it always be A,B,C,Null or A,B,Null,D? If you have C,D on the same rows for a short period... whats the reason for not having A,B,C and and A,B,D tables? Are we talking... hundreds of rows of data? Millions? billions? Is response time a factor? Lots of details that make each situation unique... — WernerCD, Jan 11 '18 at 16:42
Either you use the column or you dont. Use it, keep it. Dont, drop it. If you want to keep the data around, move it to a different table (no foreign key constraint) or export. — Thaylon, Jan 12 '18 at 13:51

score 31 · Accepted Answer · answered Jan 11 '18 at 10:45

31

If you want to keep the data, then it's not obsolete. Just leave it where it is. It's fine if some class mapped to a table doesn't map every column.

answered Jan 11 '18 at 10:45

kevin cline

33,608
3
71
142

1

you might end up with a lot of null columns after a while – Ewan Jan 11 '18 at 12:51
@Ewan, when that happens they can look at refactoring the database a bit. – Berin Loritsch Jan 11 '18 at 13:59
8

maybe they could ask for a best practice approach on stackexchange.... when that happens – Ewan Jan 11 '18 at 15:00
@Ewan You definitely will end up with nulls if you're no longer populating that column. But that shouldn't really be a problem. I mean, so there's empty data. It's one of those "yeah, so what" kind of things, I think. – corsiKa Jan 11 '18 at 18:27
8

I guess my annoyance with this kind of answer is that, sure you can get away with it, but its tech debt. Eventually you want a real solution and not have to explain to all the new hires why your now tech giant best in class company has random columns that arent used scattered through your db – Ewan Jan 11 '18 at 18:34
1

I see @Ewan 's point, but for my use case this should do. Things may be oversimplified in my head, but it should be quite straightforward to run a data migration script later, if the need arises, to copy the C data into a new table with reference to the original row in the T table, and then delete the C columns from the T table. – Jad S Jan 11 '18 at 23:01
3

@Ewan -- suppose column obsolescence is not going to happen just once -- it may happen in several times, as design requirements are discovered or change. If the alternative to a null column is to split to separate tables (eg. an inheritance structures) anytime a column becomes obsoleted, the database will be littered with join-tables for obsolete columns. I believe this is quite likely to end up worse. – Thomas W Jan 11 '18 at 23:58

Ewan · Answer 2 · 2018-01-11T16:11:50.073

8

OK so your situation is that you want the old rows to have property C but the new ones not.

This is equivalent to having a class inheritance relationship

class All
{
    string A;
    string B;
}

class Old : All
{
    string C;
}

class New : All
{
    string D;
}

which you would represent on the database with three tables with 1 to 1 relations

table All
    id varchar
    A varchar
    B varchar

table Old
    id varchar
    C  varchar

table New
    id varchar
    D  varchar

So you could create a migration script to create the new Old table, copy the id and C data to it and remove the C column from the All table.

Updating your code as required with the new sql;

Alternatively, If you just require to be able to query the old C data you could make a new Archive table with A,B,C copy all the data and remove the C column, add the D col to your 'Live' table

edited Jan 11 '18 at 16:11

answered Jan 11 '18 at 11:06

Ewan

70,664
5
76
161

1

If I split tables, I'd rather take three of them: {A,B} {C} {D} – Aconcagua Jan 11 '18 at 11:47
that doesnt match the example? – Ewan Jan 11 '18 at 12:50
wait. i miss read – Ewan Jan 11 '18 at 16:13

score 2 · Answer 3 · answered Jan 11 '18 at 15:17

If data storage could be a concern, then split the tables: key/A/B key/C key/D

You can perform access either via a view (definition of the data location in the db) or via changing the ORM definition.

This is not the most performant (a join is involved), but it can present any combination of A/B/C/D over time without changing the underlying storage & depending on your real access patterns it may be sufficient.

You may not be fortunate with the ability to take downtime, restructure tables etc. in a production system.

Performing the access via the view allows you to switch from A/B/C to A/B/C/D to A/B/D in the underlying table with minimal change and no data movement. A view will be transparent to the read logic and if your dbms supports either functions or updateable views then transparent to the write logic as well.

Really I think your decision will reflect lots of the real-world concerns: 1) what are datatypes C & D 2) the relative data volumes collected for C/D 3) Relative overlap of C/D data compared to purely C or D entries 4) Downtime/maintenance window availability and duration 5) DBMS Support for updateable views 6) Desirability of keeping db physical structure details in the ORM vs making it transparent by presenting via views/functions in the db (where it is the same for all accessing applications, not just the current one)

My answer preferred for large/complex datatypes for (1), little overlap for (3) and minimal downtime for (4), ideally with good dbms support in (5) and multiple applications accessing the data in (6)

But there is no right/wrong for lots of alternativeS: - start with A/B/C, later add D, adjusting ORM, still later drop column C - start with A/B/C/D & ignore nulls etc. I think, consider your solution & what you know of its intended purpose/lifecycle, do some size/volume modelling & expect to change things later as not everything will turn our as expected.

score 1 · Answer 4 · answered Jan 11 '18 at 10:45

Removing references & orphaning the data is a low risk option.

There are always possible unknown 'backdoor' uses of the data which may or may not be important to expose by removing the column.

Depending on the content of column C there could be a minor performance issue when the DB internally does full table scans or attempts to pull the whole table into memory during joins if the optimizer sees this as being more efficient than using indexes.

Applications may be reading the whole table a times rather than selected columns - but if you're using an ORM exclusively then this is unlikely.

Robbie Dee · Answer 5 · 2018-01-11T16:11:37.130

1

Many things to consider here but you might want to consider adding a view to overlay the table rather than making changes to the table directly. That way, it is only the view that needs to change.

I don't know Django ORM, but it could be a possibility.

edited Jan 11 '18 at 16:11

answered Jan 11 '18 at 14:16

Robbie Dee

9,717
2
23
53

2

OP said they'e using Postgres. – TripeHound Jan 11 '18 at 15:32
Thanks - didn't see a tag. I'll edit the Q. – Robbie Dee Jan 11 '18 at 16:02

score 0 · Answer 6 · answered Jan 11 '18 at 12:48

0

You have a Table A with columns a, b, c.
Create a new Table B with columns a, b, d.
Migrate your data to Table B.
Move your foreign keys to table A to table B.

You can now use Table B and you still have your old data for reference.

answered Jan 11 '18 at 12:48

Carra

4,261
24
28

What are the best practices around retiring obsolete database columns?

6 Answers6