11

After a couple of useful answers on whether I should use domain object or a unique id as method/function parameter here Identifier vs domain object as a method parameter , I have a similar question re: members (the previous questions discussion didn't manage to cover this). What are the pros and cons of using unique IDs as member vs object as member. I'm asking in reference to strongly typed languages, like Scala/C#/Java. Should I have (1)

User( id: Int, CurrentlyReadingBooksId: List[Int])
Book( id: Int, LoanedToId: Int )

or (2), preferred to (1) After going through: Should we define types for everything?

User( id: UserId, CurrentlyReadingBooksId: List[ BookId] )
Book( id: BookId, LoanedToId: UserId )

or (3)

User( id: Int, CurrentlyReadingBooks: List[Book]) 
Book( id: Int, LoanedTo: User)

While I cannot think of benefits to having the object (3), one benefits to having IDs (2) & (1) is that when I am creating the User object from DB, I don't have to create the Book object, which may in turn depend on the User object itself, creating an endless chain. Is there a generic solution to this problem for both RDBMS and No-SQL (if they are different)?

Based on some answers so far, rephrasing my question: (with use of IDs supposed to be in wrapped types) 1) Always use IDs? 2) Always use Objects? 3) Use IDs when there is a risk of recursion in serializing and deserializing, but use objects otherwise? 4) Anything else?

EDIT: If you answer that Objects should be used always or for some cases, please make sure to answer the biggest concern other answerers have posted => How to get data from DB

0fnt
  • 416
  • 3
  • 11
  • 1
    Thanks for the good question, look forward to following this with interest. A bit of a shame that your username is "user18151", people with this kind of username get ignored by some :) – bjfletcher Jun 06 '15 at 17:04
  • @bjfletcher Thank you. I did have that nagging perception myself, but it never occurred to me why! – 0fnt Jun 07 '15 at 02:19

3 Answers3

8

Domain Objects as ids create some complex/subtle problems:

Serialization/Deserialization

If you store objects as keys it will make serializing the object graph extremely complicated. You will get stackoverflow errors when doing a naive serialization to JSON or XML because of the recursion. You will then have to write a custom serializer that converts the actual objects to use their ids instead of serializing the object instance and creating the recursion.

Pass in objects for type safety but only store ids, then you can have an accessor method that lazy loads the related entity when it is called. Second level caching will take care of subsequent calls.

Subtle reference leaks:

If you use domain objects in constructors like you have there you will create circular references that will be very difficult to allow memory to be reclaimed for objects not being actively used.

Ideal Situation:

Opaque ids vs int/long:

An id should be a completely opaque identifier that carries no information about what it identifies. But it should offer some verification that it is a valid identifier in its system.

Raw types break this:

int,long and String are the most commonly used raw types for identifiers in RDBMS system. There is a long history of practical reasons that date back decades and they all are compromises that either fit into saving space or saving time or both.

Sequential ids are the worst offenders:

When you use a sequential id you are packing temporal semantic information into the id by default. Which is not bad until it is used. When people start writing business logic that sorts or filters on the semantic quality of the id, then they are setting up a world of pain for future maintainers.

String fields are problematic because naive designers will pack information into the contents, usually temporal semantics as well.

These make it is impossible to create a distributed data system as well, because 12437379123 is not unique globally. The chances that another node in a distributed system will create a record with the same number is pretty much guaranteed when you get enough data in a system.

Then hacks begin to work around it and the entire thing devolves into a pile of steaming mess.

Ignoring huge distributed systems ( clusters ) it becomes a complete nightmare when you start trying to share the data with other systems as well. Especially when the other system is not under your control.

You end up with the exact same problem, how to make your id globally unique.

UUID was created and standardized for a reason:

UUID can suffer from all the problems listed above depending on which Version you use.

Version 1 uses a MAC address and time to create a unique id. This is bad because it carries semantic information about location and time. That is not in itself a problem, it is when naive developers start relying on that information for business logic. This also leaks information which could be exploited in any intrusion attempts.

Version 2 uses a users UID or GID and domian UID or GUI in place of the time from Version 1 this is just as bad as Version 1 for data leakage and risking this information to be used in business logic.

Version 3 is similar but replaces the MAC address and time with a MD5 hash of some array of byte[] from something that definitely has semantic meaning. There is no data leakage to worry about, the byte[] can not be recovered from the UUID. This gives you a good way to deterministically create UUID instances form and external key of some sort.

Version 4 is based only on random numbers which is a good solution, it carries absolutely no semantic information, but it is not deterministically re-creatable.

Version 5 is just like Version 4 but uses sha1 instead of md5.

Domain Keys and Transactional Data Keys

My preference for domain object ids, is to use Version 5 or Version 3 if restricted from using Version 5 for some technical reason.

Version 3 is great for transaction data that might be spread across many machines.

Unless you are constrained by space use a UUID:

They are guaranteed unique, dumping data from one database and reloading into another you never had to worry about duplicate ids that actually reference different domain data.

Version 3,4,5 are completely opaque and that is they way the should be.

You can have a single column as the primary key with a UUID and then you can have compound unique indexes for what would have been a natural composite primary key.

Storage does not have to be CHAR(36) either. You can store the UUID in a native byte/bit/number field for a given database as long as it is still indexable.

Legacy

If you have raw types and can not change them, you can still abstract them away in your code.

Using a Version 3/5 of UUID you can pass in the Class.getName() + String.valueOf(int) as a byte[] and have a opaque reference key that is recreatable and deterministic.

  • I am very sorry if I wasn't clear in my question, and I feel all the worse (or actually good) because this is such a great and well-thought answer and you clearly spent time on it. Unfortunately it doesn't fit my question, maybe it deserves a question of its own? "What should I keep in mind when creating id field for my domain object"? – 0fnt Jun 06 '15 at 02:57
  • I added an explicit explanation. –  Jun 06 '15 at 15:27
  • Got it now. Thanks for spending time on the answer. – 0fnt Jun 06 '15 at 15:33
  • 1
    Btw, AFAIK generational garbage collectors (which I believe is whats the dominant GC system these days) should not have too much difficulty in GC'ing circular references. – 0fnt Jun 06 '15 at 15:42
  • 1
    if `C-> A -> B -> A` and `B` is put into a `Collection` then `A` and all its children are still reachable, these things are not completely obvious and can lead to subtle *leaks*. `GC` is the least of the problems, serialization and deserialization of the graph is a nightmare of complexity. –  Jun 06 '15 at 16:15
2

Yes, there are benefits to either way, and there's also a compromise.

List<int>:

  • Save memory
  • Faster initialization of type User
  • If your data comes from a relational database (SQL), you don't have to access two tables to get users, just the Users table

List<Book>:

  • Accessing a book is faster from user, the book has been preloaded into memory. This is nice if you can afford to have a longer start up in order to get faster subsequent operations.
  • If your data comes from a document store database like HBase or Cassandra then the values of books read are likely on the User record, so you could have easily gotten the books "while you were there getting the user".

If you have no memory or CPU concerns I would go with List<Book>, the code that uses the User instances will be cleaner.

Compromise:

When using Linq2SQL, the code generated for the entity User will have a EntitySet<Book> which is lazy loaded when you access it. This should keep your code clean and the User instance small (memory footprint wise).

ytoledano
  • 281
  • 2
  • 9
  • Assuming some sort of caching, preloading benefit would be null. I haven't used Cassandra/HBase so can't speak about them but Linq2SQL is a very specific case (although I don't see how lazy loading will prevent the infinite chaining case even in this specific case, and in the general case) – 0fnt Jun 05 '15 at 16:03
  • In the Linq2SQL example you really get no performance benefit, just cleaner code. When getting one-to-many entities from a document store like Cassandra/HBase, the vast majority of the processing time is spent finding the record, so you might as well get all the many entities while you're there (the books, in this example). – ytoledano Jun 05 '15 at 20:51
  • Are you sure? Even if I store Book and Users separately normalized? To me it looks like it should only be network latency extra cost. In any case, how does one handle the RDBMS case generically? (I've edited the question to mention that clearly) – 0fnt Jun 06 '15 at 01:30
1

Short and simple rule of thumb:

IDs are used in DTOs.
Object references are usually used in the Domain Logic/Business Logic and UI layer objects.

That's the common architecture in larger, enterprisey enough projects. You'll have mappers that translate to and fro these two kinds of objects.

herzmeister
  • 429
  • 2
  • 6
  • Thank you for stopping by and answering. Unfortunately, while I do understand the distinction thanks to the wiki link, I have never seen this in practice (granted I've never worked with large long-term projects). Would you have an example where the same object were represented in two ways for two different purposes? – 0fnt Jun 09 '15 at 12:42
  • here is an actual question concerning mapping: https://stackoverflow.com/questions/9770041/dto-to-entity-mapping-tool -- and there are critical articles like this: http://rogeralsing.com/2013/12/01/why-mapping-dtos-to-entities-using-automapper-and-entityframework-is-horrible/ – herzmeister Jun 09 '15 at 17:33
  • Really helpful, thanks. Unfortunately I still don't understand how would loading data with circular refernces work? e.g. if a User refers a Book and the Book refers the same user, how would you create this object? – 0fnt Jun 09 '15 at 17:45
  • Look into the [Repository pattern](http://martinfowler.com/eaaCatalog/repository.html). You'll have a `BookRepository` and a `UserRepository`. You'll always call `myRepository.GetById(...)` or similar, and the repository will either create the object and load its values from a data store, or get it from a cache. Also, child objects are mostly lazy loaded, which also prevents having to deal with direct circular references at construction time. – herzmeister Jun 10 '15 at 11:15