Elasticsearch and RDMBS combination

Question

I read both this thread Elasticsearch and PostgreSQL combination and this one Elasticsearch and relational database combination, but I could not come to an answer.

Let's say I have to re-create a search engine for MDPI (https://www.mdpi.com/). They provide dumps in a JSON format, in which each record is a scientific article. One of the fields is "journal", and for each record is always repeated the journal title, the journal year foundation, the journal impact rate and the journal id.

From an Elasticsearch perspective I see this makes a lot of sense, since it is supposed to search along json fields. But I am worried about redundancy. Since there are only few journals and millions of articles, wouldn't make more sense to only save the "id" in the fields and have a second json file for retrieving all the other journals information with a columnar database as a RDMBS?

So I came up with two designs:

Preserving the original JSON files:

Pro: only using Elasticsearch
Cons: redundancy of the data in the JSON files

Splitting into two (or more) JSON files with a RDMBS and using ES only for searching among "title" and "abstract" fields

Pro: getting the best of the two worlds. Elasticsearch giving its best at full-text search, and once the records are retrieved and ranked, using the IDs to easily fetch the other data from RDBMS.
Cons: setting up two systems, meaning taking care of scaling both and doubling probabilities of bottlenecks.

Which one of the two above (or please, describe a third one that is even better) would be a good practise? Is there a best practise?

score 1 · Answer 1 · answered Jul 02 '20 at 17:41

As you have observed, relational database design is largely about reducing the number of times data is represented through normalization. ElasticSearch is, by and large, all about denormalization and that means repetition. My guess would be that the denormalized document format for ElasticSearch was chosen because the servers are clustered - each node can examine the documents it has, return the results, and then the search result is just the combined results from all. Elminating the data duplication would make this a great deal more complicated.

Either way, the two different database systems look at the world differently. As a rule of thumb, relational databases are best where there is a high need for consistency and ElasticSearch is geared towards doing search (which SQL is kind of painful for). Most real-world solutions treat ElasticSearch as the search index with some other database as the system of record, so having both isn't uncommon.

That said, if you are specifically building the search part of the equation (presumably the list of journals and articles was pulled from some other database), then I wouldn't feel bad about accepting the data duplication and building the solution entirely on ElasticSearch. If this is a small piece of a larger system (say you're rebuilding arxiv.org), I would plan for both.

Elasticsearch and RDMBS combination

1 Answers1