Describing the situation
I'm working on an application (based on the Spring Framework) using a search index (lucene if that matters) to make content of that application searchable. Documents are added/updated in that index whenever the content of the application is being changed and deleted whenever the corresponding content is deleted.
We had a bug where the trigger to update a document on content changes did not work in some cases. Therefore some of the documents contain invalid (out-dated) values. This problem has been resolved, so future changes will be correctly written to the index.
However I want to fix the invalid documents in the index and would like to know what the best strategy would be to do so. Important conditions are:
- Recalculating the complete index requires multiple hours and the application is redeployed regularly as part of continuous deployment. Therefore it must be expected that the application is shut down in between the updating process.
- Most of the documents are not invalid.
- I'm not able to recognize invalid documents based on the index alone. This would require lots of information from a database.
- The invalid values of the documents are not particularly important. The most relevant field (the name) was not affected by the bug. Therefore even documents with invalid values work correctly in most of the use cases.
- I would like a solution that will work for future issues too.
I think a similar case occurs if we extend the index in future versions, e.g. add a field. This would require us to update all documents to add the field, while the main use case of the index will work without that field too.
Possible solution
My idea is to add a version field to the documents. I would then add a job that runs all few minutes, fetches a batch of documents with an old version (or without any version for the initial run), recalculate the required fields, set the version field to the current version and update the document in the index.
Pros of this solution:
- If the update is interrupted, the application recognizes which documents are already fixed and which aren't.
- This information is stored within the index, so where it belongs (I had some ideas before where I would store the information in the database).
Con of this solution:
- Will need to update every document, even if it has no invalid values.
My question
Is this a reasonable solution for the problem? Are there any better approaches to do this? I couldn't find find anything how to solve this nor any information that it is a good idea to add a version to your documents.
Maybe I'm also overthinking the situation and a much simpler solution is possible?