I read both this thread Elasticsearch and PostgreSQL combination and this one Elasticsearch and relational database combination, but I could not come to an answer.
Let's say I have to re-create a search engine for MDPI (https://www.mdpi.com/). They provide dumps in a JSON format, in which each record is a scientific article. One of the fields is "journal", and for each record is always repeated the journal title, the journal year foundation, the journal impact rate and the journal id.
From an Elasticsearch perspective I see this makes a lot of sense, since it is supposed to search along json fields. But I am worried about redundancy. Since there are only few journals and millions of articles, wouldn't make more sense to only save the "id" in the fields and have a second json file for retrieving all the other journals information with a columnar database as a RDMBS?
So I came up with two designs:
Preserving the original JSON files:
- Pro: only using Elasticsearch
- Cons: redundancy of the data in the JSON files
Splitting into two (or more) JSON files with a RDMBS and using ES only for searching among "title" and "abstract" fields
- Pro: getting the best of the two worlds. Elasticsearch giving its best at full-text search, and once the records are retrieved and ranked, using the IDs to easily fetch the other data from RDBMS.
- Cons: setting up two systems, meaning taking care of scaling both and doubling probabilities of bottlenecks.
Which one of the two above (or please, describe a third one that is even better) would be a good practise? Is there a best practise?