4

I'm starting the modeling of the following problem:

I have many clients (millions) that interact in a network forming a graph. At the highest granularity of my business problem, each realationship has 3 attributes.

In the "realistic" worst case scenario, the nodes compose a unique mesh, having relationships with ~100 other nodes, but with one aggravating: these relationships will be often bidirectional and there should also be splited by the different possible values of one of the 3 attributes (namely, the Region the client operates). This leads me to the following sizing:

2 directions * 6 regions * 100 nodes = 1200 edges per node

It gets worse. One of the requirements is that this database should contain a monthly history of the last 5 years, in case the user ever wants to review a past score of any of the nodes, which means, I also have to consider 1 entry per month per node, giving me that the following sizing:

2 directions * 6 regions * 100 nodes * 60 months = 72000 edges per node

We are inclined to use a Graph Database to solve this problem, but the question is: is this even feasible with current graph tecnologies available?

Henrique Barcelos
  • 909
  • 1
  • 10
  • 18
  • You might use a mixed database, like here: http://programmers.stackexchange.com/questions/50043/is-orientdb-document-database-or-graph-database?rq=1 which can allow you to store history in a different way and keep the graph just the current one. – Luc Franken Jun 28 '16 at 16:30
  • Unfortunately, our metodology eliminated OrientDB before we could even get started with it. Maybe I'll look into another implementation of mixed database =] – Henrique Barcelos Jun 30 '16 at 03:06
  • There are many of them, try for example https://www.arangodb.com or Neo4j which is a more stable Graph database. You don't even need them in the same database, you can even keep the history in something like a relational database or for example MongoDB if you have a lot of data. – Luc Franken Jun 30 '16 at 07:54

3 Answers3

4

Your problem of historical data is in fact quite common: business related data is often related to time.

Snaphsot approach

One way to handle this is to take snapshots. It's the solution proposed by @CandleOrange. But it seems also to be your assumption: in your sizing you expect to have a different occurence of each combination for each month (completely equivalent to the snapshot approach). But you think to keep it in the database.

As an example, one leading business software package manages sales figures in this way: for every month, it sums up the sales of that month, for each combination of customer category, product group, and region. In that way it can produce sales figures according to the criterias and for any given time period, without having to read the millions and millions of underlying transactions.

In fact, there is no difference here between data and historical data. From a business point of view, the month are integral part of the data that is managed. And it is frequent to compare sales figures of one yer to the previous year, but also of one month to the previous month, or the first quarter of this year with the first quarter of the 2 last years.

Date of validity approach (time dependent versioning)

But the snapshot approach does not always suits the need. Very often data can be time dependent. For example salesman X can be responsible for a region for March 1st to August 18th and for another region from August 19th to today. So what region keep for him for August ? None is more exact than the other.

For graph relationships it's event worse. Because there are time dependent aspects in every data and in every relation. Let's take an example: Manager A is responsible of department X from January 20th to September 20th. Employee B is assigned to department X from Last year to January 22, and employee C is assigned to that department since he joined the company and is sill here. So who does the Manager A manage ? It completely depends on the date.

Now, what's the size of the database if you compare time dependent approach and snapshot ? For the snapshot you have to clone every relevant value every month. So for the 3 persons and 1 department of our example, you'll have 36 records over a year. And you still are not able to hold the full truth. For the time dependent approach, you have your raw data 3 records, and you only have additional data when a change occurs, so 6 records overall.

Time dependent versioning of data represents hence very accurately the real world, and space savy. But it makes design of queries a lot more complex.

To avoid a complete query nightmare, you can work by default with the current version of each object (e.g. end validity is empty), and create time limited clones (start and end validity date set) every time you change important data. You would then access these clones and their validity dates only for the couple of time-dependent queries that require it.

Support by database systems

For most database systems, you have to take care yourself for the time dependent aspects. Time dependency of data elements must be identified from the start and it's hard case-by-case design work.

But as already said: it's a common challenge. So fortunately there is some support around:

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
Christophe
  • 74,672
  • 10
  • 115
  • 187
2

I once solved the history problem for a database by simply making copies of it.

I made a copy of the database every night for a week. Every week for a month. Every month for a year. And kept backups from every year. Did it all with a simple script. It solved the history problem, kept the size reasonable, and kept the database clean. This comes with limitations of course but it's such a simple way to solve the history requirement consider it before you overdesign something you don't really need.

The nice thing about the copy-the-DB approach, besides not being huge (66 times + 1 per year), is that everything is still linked to what it was linked to back when it had that value. The drawbacks are: it's lossy (so transient values may be missed) and it is isolated (you can't run queries that join the present and the past).. There are other approaches ,

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • Yeah, that's our first thought, but the business analysts state that the history cannot be isolated from each oter, since the state on Jul'16 affects the state in Aug'16, Sep'16 and so on. I am inclined to declare that this can't be done. Our rough calculations show that this database would contain 116 GiB of data per month. – Henrique Barcelos Jun 29 '16 at 14:14
  • 1
    Many things can't be done. What people really want usually can be done, if you understand it. I submit that you haven't made the requirements clear. What is a "monthly history of the past 5 years" if "history cannot be isolated"? Are you sure these are the same requirement? – candied_orange Jun 29 '16 at 17:01
  • 1
    Be careful with square one. Of all the places it can lead it tends to most often lead one of two places: the exact same mess or an even bigger mess. It's easy to look at a failing project and focus only on what's wrong. A lot of work has been done at this point. Spend some time figuring out what isn't broken, or you'll find out when you break that too. – candied_orange Jun 29 '16 at 17:40
-1

A bit late to the party, but maybe this would be of use to someone who recently had this same question and found this thread:

https://github.com/CivicGraph/CivicGraph

Disclaimer: I am the author/maintainer of CivicGraph, and have made it to solve this very problem of retaining and querying history in graph databases.