2

I'm designing the data model for a project where users can create documents, save revisions of that document that can be reverted to, and publish documents. The basic data model I have right now is something like this:

Documents
---------
id: integer PK
currentRevisionId: integer FK references DocumentRevisions(id)
publishedRevisionId: integer FK references DocumentRevisions(id)


DocumentRevisions
-----------------
id: integer PK
documentId: integer FK references Documents(id)
documentBody: text

I need to know what the published and current revisions are for a document, but I also need to know what document a revision refers to. I'm not sure how to model this without creating this circular reference.

mplis
  • 169
  • 5
  • 1
    This looks like a situation where a circular reference is a good approach – paj28 Apr 22 '16 at 14:10
  • 1
    possible duplicate of [When is a circular reference to a parent pointer acceptable?](http://programmers.stackexchange.com/questions/306518/when-is-a-circular-reference-to-a-parent-pointer-acceptable) –  Apr 22 '16 at 14:14

4 Answers4

2

Here's my two cents.

  • If you want to know the current revision you should query the REVISION table for the max(revision_id) for that document_id. The same for PUBLICATION. You can have a view V_DOCUMENT with that already resolved via a join, so to the programmers it would be as if there's a V_DOCUMENT "table" with a CURR_REV_ID and a CURR_PUB_REV_ID.
  • If you really want to de-normalize that information, i.e, have it pre-calculated in DOCUMENT, then you should add a trigger on REVISION and PUBLICATION to keep those columns updated in DOCUMENT, but those wouldn't be FKs pointing back to REVISION and PUBLICATION, thus avoiding circular references.
  • It's a good practice to separate binary large objects like images or very long text content to its own table (even when the conceptual relationship would be one-to-one) for performance reasons as well as for storage administration reasons. The main table will be queried and read a lot, so taking the BLOB out is good. A DBA can also make sure the BLOB is stored in a separate tablespace, meaning writing such large objects doesn't impact the speed of the transactional access of the other tables. Also such tablespace can be put in a slower, cheaper storage.

Conceptual:

enter image description here

Physical:

enter image description here

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • What's the reasoning behind separating `content` from `revision`? – mplis Apr 22 '16 at 17:51
  • 1
    @mplis It's a good to separate binary large objects like images or very long text content from the main table (even when the conceptual relationship would be one-to-one) for performance reasons as well as for storage administration reasons. The main table will be queried and read a lot, so taking the BLOB out is good. A DBA can also make sure the BLOB is stored in a separate tablespace, meaning writing such large objects doesn't impact the speed of the transactional access of the other tables. Also such tablespace can be put in a slower, cheaper storage. – Tulains Córdova Apr 22 '16 at 18:00
  • @mplis I added my comment to the body of the answer as a third bullet point. – Tulains Córdova Apr 22 '16 at 18:03
  • This is excellent, thank you! I like having the `publication` table, but one thing I'm having trouble figuring out (I know this wasn't mentioned in OP) is how to unpublish a document. Could maybe the `publication` table have another column with a `published` flag set to `true` on publish and set to `false` on unpublish? – mplis Apr 22 '16 at 21:29
  • You could simply erase the row. Or you could add a flag to invalidate a publication. – Tulains Córdova Apr 22 '16 at 23:23
1

First: naming. The table called "Documents" above stores knowledge of revisions, not docs or contents. Perhaps it could be called "DocumentRevisions". Second: separate the document bodies table's index pk from document id, and store variations on the document body in the table with the same document id, yet different table index. In the end you get:

Documents
---------
document_id: integer PK
title: text
other_info_common_to_all_revs_of_the_doc: text

DocumentContents
---------
document_contents_id: integer PK
document_id: integer FK references Documents(document_id)
revision: integer
document_body: text

DocumentRevisions
-----------------
document_revisions_id: integer PK
document_id: integer
current_revision: integer 
published_revision: integer
-- table has composite FK (document_id,current_revision) references DocumentContents(document_id,revision)
-- table has composite FK (document_id,published_revision) references DocumentContents(document_id,revision)

This ensures that all document bodies are stored with RI to their doc id and revision.

The composite FK's ensure RI on doc/rev.

scorpdaddy
  • 206
  • 1
  • 3
  • So how does this work if I want to update a document (e.g. document_id = 1) to a new revision? Is it something like: 1) add new row in `DocumentContents` with document_id = 1 with new `body` 2) update `current_revision` column in `DocumentRevisions` where document_id = 1 to point to newly created row in `DocumentContents`? – mplis Apr 22 '16 at 15:46
  • Yes. Precisely. And in that order. – scorpdaddy Apr 22 '16 at 17:41
0

What you are doing seems reasonable, the only trick is you'll need to insert the Document with null revisions, then add the revisions, then update the Document. When adding revisions, you just add the revision then update the Document.

An alternative approach is to have "current" and "published" flags on the revisions. If you have partial unique index support you can let the database ensure only one revision is flagged per document:

Documents
---------
id: integer PK


DocumentRevisions
-----------------
id: integer PK
documentId: integer FK references Documents(id)
current: boolean not null
published: boolean not null
documentBody: text

-- Example partial unique indexes
create unique index DocumentRevisions_current_unique on DocumentRevisions (documentId) where current;
create unique index DocumentRevisions_published_unique on DocumentRevisions (documentId) where published;

-- You then select for current revision
select * from DocumentRevisions where documentId = ? and current;
Dan Armstrong
  • 509
  • 2
  • 4
0

I suggest using the Command pattern and an Event Store.

Instead of complete revisions of a document, break each change down into a command.

Ie InsertTextAt(...)

Store the data needed to execute the command in the db.

As the user edits the document you build up a list of these commands.

When you recreate the document you load all the commands from the db and rerun them in sequence.

This allows you to do undo style functionality etc

Ewan
  • 70,664
  • 5
  • 76
  • 161