Strategy to update search index after fixing index generation

Question

Describing the situation

I'm working on an application (based on the Spring Framework) using a search index (lucene if that matters) to make content of that application searchable. Documents are added/updated in that index whenever the content of the application is being changed and deleted whenever the corresponding content is deleted.

We had a bug where the trigger to update a document on content changes did not work in some cases. Therefore some of the documents contain invalid (out-dated) values. This problem has been resolved, so future changes will be correctly written to the index.

However I want to fix the invalid documents in the index and would like to know what the best strategy would be to do so. Important conditions are:

Recalculating the complete index requires multiple hours and the application is redeployed regularly as part of continuous deployment. Therefore it must be expected that the application is shut down in between the updating process.
Most of the documents are not invalid.
I'm not able to recognize invalid documents based on the index alone. This would require lots of information from a database.
The invalid values of the documents are not particularly important. The most relevant field (the name) was not affected by the bug. Therefore even documents with invalid values work correctly in most of the use cases.
I would like a solution that will work for future issues too.

I think a similar case occurs if we extend the index in future versions, e.g. add a field. This would require us to update all documents to add the field, while the main use case of the index will work without that field too.

Possible solution

My idea is to add a version field to the documents. I would then add a job that runs all few minutes, fetches a batch of documents with an old version (or without any version for the initial run), recalculate the required fields, set the version field to the current version and update the document in the index.

Pros of this solution:

If the update is interrupted, the application recognizes which documents are already fixed and which aren't.
This information is stored within the index, so where it belongs (I had some ideas before where I would store the information in the database).

Con of this solution:

Will need to update every document, even if it has no invalid values.

My question

Is this a reasonable solution for the problem? Are there any better approaches to do this? I couldn't find find anything how to solve this nor any information that it is a good idea to add a version to your documents.

Maybe I'm also overthinking the situation and a much simpler solution is possible?

Roughly how many records need fixed vs. how many do you have overall? How hard would it be to figure out which records are invalid? Seems to me like you might be able to figure out which records are invalid in a one time background job, save that list to a file then process those later during low traffic periods / during a maintenance window. — Becuzz, Aug 04 '17 at 13:23
@Becuzz Absolute numbers are hard to tell as there are multiple installations. Ranging from a few hundred up to hundreds of thousands records. Fixes are required for maybe round about 1% of those records. The time consuming process is loading the required data from the database and that would need to be done to figure out which records are invalid - thus getting the same problem for the one time background job. — lucash, Aug 04 '17 at 13:57
@BerinLoritsch And how would you do that considering the given conditions? If I clear the index and completely recalculate it, the search wouldn't be available for multiple hours to fix a bug which has only minor effects to the search. — lucash, Aug 04 '17 at 14:00
@lucash How does loading the data from the database create the same problem? Is loading the data going to cause your application to grind to a halt? Can you load it in a separate process to do the work? Could you just use a database backup and another server / machine to do all that work? The idea would be to do this in a way that doesn't heavily impact your application so you can just fix it whenever / a little at a time rather than introduce downtime. — Becuzz, Aug 04 '17 at 14:06
@Becuzz You are right, the load might be another issue. My main point was that restarts of the application would have the effect that I don't know which records are already updated and which still need an update. Doing the update with a separate application/on a different server would solve both issues (assuming that the separate application just runs until the update is done), however making the topic a lot bigger than I had in mind before. Thanks for your help! — lucash, Aug 04 '17 at 14:29

score 2 · Accepted Answer · answered Jan 02 '18 at 13:56

TL;DR: You can do things faster.

I used to have a similar problem. Turns out, with Lucene, rebuilding the entire index using code optimized for it, starting from the scratch, using a SELECT field1, field2 FROM table query with no where clause, was quite fast. The reason for the quick execution time probably was that the Lucene-interfacing code didn't need to worry about parallel Lucene queries, and the SQL statement could use a sequential scan.

For this reason, we had indexOne and indexAll methods, separately written.

I'm not sure if it's fast enough for your case, but you should take a hard look at the Lucene API, see what kind of performance-improving characteristics it has: don't synchronize for every single item? Don't care about parallel queries? Rebuild the index from the scratch? I don't remember what features of the Lucene API we used to obtain the great performance because it was 8 years ago, but I do recall the performance of indexAll was much better than indexOne.

Of course, this kind of approach requires some downtime. But with good enough optimization, perhaps you can push the downtime to an acceptable level, from hours to merely minutes.

If you are unable to push the downtime to an acceptable level, your only solution is a background indexer. We had a table named indexqueue in the database, and it contained object identifiers. The background indexer regularly fetched rows from this table and reindexed only those rows fetched. Then you just insert every object identifier to the indexqueue table and wait for all to be reindexed.

As for the record count, we had millions of records. Your hundreds of thousands of records problem, with the slight progress in computer performance over 8 years and the great progress in computer memory amount, is minuscule in comparison to the millions of records of problem we had. We used spinning hard disks back then, and today's servers use SSDs!

I'm pretty sure you can reindex thousands of rows per second, if doing things optimally, unless of course each row has some kind of 1000 page book text body. This makes it hundreds of thousands of rows per minute.

Doc Brown · Answer 2 · 2018-02-04T14:25:16.747

Even if you introduce something like your version field into the documents, you cannot know in advance what kind of bugs will be there in the future. There can always be one which makes it necessary to rebuild the whole index. So you should look for a strategy which makes this possible.

You wrote

Recalculating the complete index requires multiple hours and the application is redeployed regularly as part of continuous deployment. Therefore it must be expected that the application is shut down in between the updating process.

and that is actually the problem and it's solution - as soon as you decouple the main application (and its deployments) from the process/program which recalculates the index, your problem is solved. Then you can shutdown and redeploy the application without shutting down the recalculation process.

Of course, you will probably have to take some additional measures like

design the recalculation in a way it does not interfer with the main application's search feature. For example, you could first make a long-running process which calculates the necessary changes to the index without applying them, and apply the changes afterwards (which needs hopefully just a fraction of the total time, if they affect only a small fraction of the index entries)
log if the recalculation was sucessfully applied, so it does not get applied a second time
and, if necessary: consider to implement a strategy for restarting the index recalculation in case of an interruption at the point where it was interrupted.

score 0 · Answer 3 · answered Aug 04 '17 at 23:38

Here are a couple of alternate solutions to your problem:

You could use the blue-green deployment strategy with your search index. Basically you would create the fresh index periodically in the standby deployment. When the index is finished building, you would switch to use the standby in production, and your previously live index would become the new standby.
You could use your search index only for search and lookup the actual documents and attributes from your database. So your search results would return the item id's and you would query the up-to-date copies for the attribute of each item.

The problem with rebuilding the index is that the interval between deployments is less than the time it takes to build the index. You would still need to know which documents have been reindexed and which have not. — bikeman868, Oct 04 '17 at 04:55

Strategy to update search index after fixing index generation

Describing the situation

Possible solution

My question

3 Answers3