How to track change of JSON data over time for large number of entities?

Question

I have a system that checks the status of a large number of entities on schedule every minute. For each entity, there would be a JSON file which has fields indicating the statuses for different attributes. The system dumps these JSON files on a network share.

Each run of the schedule that runs every minute generates a JSON with 20k odd entities like these having tens of attributes.

[
    {
        "entityid": 12345,
        "attribute1": "queued",
        "attribute2": "pending"
    },
    {
        "entityid": 34563,
        "attribute1": "running",
        "attribute2": "successful"
    }
]

I need to be able to track the change of attribute status of the entities over time, for instance, answer questions like when did the status of entity x become "pending". What is the best way to store this data and generate the stats?

You need more than just a "these entities just changed their states in this way" notification, right? If so, how much history do you need to retain? — Daniel Griscom, Dec 01 '18 at 14:14
How long does this stuff have to live (especially subsequent JSON file updates)? To me, this shouts "Put it in a database" — Jan Doggen, Aug 24 '20 at 08:17
Is this something that deep object diffing, like [deep-diff](https://github.com/flitbit/diff) (JavaScript) or [deepdiff](https://pypi.org/project/deepdiff/) (Python), could help with? — Ahmed Fasih, May 20 '21 at 04:45

score 0 · Answer 1 · answered Dec 02 '18 at 18:58

Overview

I think you can solve your problem in a relatively simple 3-step process:

Given two (consecutive) snapshots of the state of your entities, determine the changes between them.
Repeat this step until all (available) snapshots are processed and store the changes somewhere.
Query the stored changes for something that is of interest to you.

Finding the Changes

I would most likely create a hash table for the first snapshot. There are a few options here:

Use the Id to map to a data structure that holds your entity's values
Use an (Id, AttributeName) tuple as the key to map to the values directly. Depending on your language, this might only make sense if all values are of the same type.
Do the same as above, but use one hash table for each type of attribute.

Now you turn the second snapshot into a hash table and compare it to the first. When you've found and stored all changes, you discard the first hash table (but keep the second one) and repeat this procedure with snapshots two and three - and so on...

Storing the Changes

Each change you find can be represented by a tuple such as (Time, EntityId, AttributeName, OldValue, NewValue). Depending on what you'd like to query, you may not need all of these fields.

Once you've found a change, the question becomes where to store them. A database seems like the ideal solution. If you have enough memory and don't want to persist the changes, you can use an in-memory DB.

The database will provide all the features to make querying easy and efficient. In particular, you'll have an established query-language and can create the relevant indices.

Added and Removed Entities

If the set of monitored entities remains constant, you can find all differences by simply iterating over one hash table's keys and comparing the key's values in both tables.

However, when entities may be added and removed, it may be helpful to deal with each case (added, changed, removed) separately.

Added entities can be easily found while building the new hash table. Simply check whether the entity already existed in the old one.
Removed entities can be found together with changed entities while iterating over the entities in the old table.

Alternatively, you can of course use the intersect/complement operations on your key-sets.

Pieter B · Answer 2 · 2021-05-20T12:01:26.983

You could work with versioning and immutability, this way you basically make a new entity when you change it, your entity is the entityid in combination with highest version. Clean up when entities are fully out of scope.:

[
    {
        "entityid": 12345,
        "attribute1": "queued",
        "attribute2": "pending",
        "version": "1",
        "created": "17:25"
    },
    {
        "entityid": 12345,
        "attribute1": "running",
        "attribute2": "successful",
        "version": "2",
        "created": "17:48"
    },
    {
        "entityid": 34563,
        "attribute1": "running",
        "attribute2": "successful",
        "version": "1",
        "created": "17:20"
    }
    {
        "entityid": 34563,
        "attribute1": "finished",
        "attribute2": "successful",
        "version": "2",
        "created": "17:47"
    }
]

For completeness' sake, this answer is essentially using an **event sourcing** approach. There's a lot of online resources on this topic if you look for it. — Flater, May 20 '21 at 07:47

How to track change of JSON data over time for large number of entities?

2 Answers2