Metadata evolution considerations?

Question

As application evolves, so does the metadata. What are the top considerations for transitioning into a more refined metadata schema?

Two example cases, which in my opinion are quite archetypal.

Case 1. New metadata is more structured and detailed. For example, Location field goes from free text to Country, City, and the rest of the Street address. (this is just an example. Imagine for a moment there are no gazeteer services which do decent job of converting free text to structured location)

Case 2. New metadata has attributes, usually expressed in another attribute informally. For example, Employment Info (previously free text) needs more precise attributes on it's own - Job category (controlled vocabulary) and Type of Contract (also controlled vocabulary) (and the original Employment Info can have even more meaning, so it also stays).

Problem arises while "aligning ontologies" when integrating with another systems. As I see it there are two forces here. One force is - end user input UI needs to be kept minimal. (Given lack of de facto standard in HR domain and several systems to integrate to there may be lot of similar metadata to be entered). Another force is - metadata consumers need that in a particular format and type, not always matching that of the source system. Yet another force (a weaker one) is to keep that aligning process computationally efficient.

The cases also apply to evolution of the same system. Today its enough to have informal text field, tomorrow it need to be extended with semantically strict attributes. At the same time, the old informal usage may still be enough for many users.

What could be good approach and what a most important considerations?

score 1 · Accepted Answer · answered May 11 '18 at 16:13

I'd follow these principles:

Keep (meta)data format expandable; do not force expansion.
Allow verification and sanity checks.
Allow and encourage comments.
Always version the data.

Point by point, from last (easiest) to first (hardest).

Always version the data

Your metadata representation will evolve. Version each such change (e.g. using semver). This prevents the problems of guessing what format the data is really in, and how to correctly interpret it when different versions allow for different interpretations.

Your implementation(s) should be ready to interpret a range of versions, and handle older versions' data which are inevitably less complete and detailed.

If possible, allow (but not enforce) version info in each node of the metadata document. This allows for gradual migration of parts that need the new version's features without a forced migration of the rest.

This is problematic if you don't have a single versioning authority, and several independent users create diverging versions of metadata formats. This can also be solved, but for simplicity it's best avoided.

Allow and encourage comments

Let people add free-form context to the formal representation. Loss of such context leads to second-guessing and horrible mistakes as the data are edited and/or migrated to newer representations.

Encourage adding comments in your metadata-editing tool. Restrict the amount and format of the comments as little as possible.

JSON is a notoriously bad config format because it lacks comments; if it is used, allow an extra "comment" field in any object.

Note that comments should be normally accessible as a part of the metadata, so that any tools that evolve would easily see and update them.

Allow verification and sanity checks

As much as your problem domain allows, add an automatic tool to check the metadata and detect common problems. Check for inconsistent values, for inconsistent names, spelling errors (a particularly nasty problem in stringly-typed data which most metadata are). If you have some formal rules (a kind of schema), check for its violations (e.g. number format and ranges, date format, empty values, any references to other nodes, etc).

Most likely you will have to allow some violations to stay, until a better way to express something is implemented. But you want to make them easily visible.

Keep (meta)data format expandable; do not force expansion

Take a format which is easy to expand and understand. Most likely it's going to be a tree (xml, yml, s-expressions, json, ucf, etc) with primitive attributes, lists, and other trees as nodes.

I would only allow "value", "version", and "comment" to carry primitive values, and always combine them in a subtree / "object" under the "real" name:

foo {
  comment: the metasyntactic variable. # The comment to "foo".
  version: 1.2  # The version of the schema for "foo"; optional.
  value: bar # The value we store.
}

Eventually you'll want to add more structure:

foo {
  comment: ...
  version: 2.0
  value: bar
  constraints {
    allow-empty: false
    min-length: 3
    all-lowercase: true
  }
}

Then you will want to factor out common parts. Allow for some kind of substitution that preserves the correctness of the format, and likely a service namespace:

 @define {
   @version {  # Only applicable to nodes with this version range.
     from: 2.0
     to: 3.1
     constraint var-name {
       allow-empty: false
       min-length: 3
       all-lowercase: true
     }
   }
 }
 ...
foo {
  comment: ...
  version: 2.0.1
  value: bar
  constraints {
    @var-name  # Refer to a factored-out definition.
    max-length: 10
  }
}
moo {
  comment: What the cow says. 
  version: 1.5.7  # Old version node, old format.
  value: MOO!
}

There's going to be a lot more details involved in such an piecemeal-evolvable document. I still think it's a good enough starting point. The only factor that defines architecture's longevity is its ability to change, they say.

This is good answer for a situation when you can control metadata (applications own evolution). My question was also about how to please integrations, which need a slightly different metadata. There is probably no better way but ask users to fillin all information for all involved systems. For example, 3rd party systems need A1, A2 and A3 schema for attribute A. Then user will need to input them all to our system. Or we may need a combo A123 fine-grained enough to hold the A1, A2, A3... — Roman Susi, May 11 '18 at 19:05

Metadata evolution considerations?

1 Answers1

Always version the data

Allow and encourage comments

Allow verification and sanity checks

Keep (meta)data format expandable; do not force expansion