Hypothesis
Power supply spikes to the RTC, e.g. during main power on/off, causing data corruption and hardware damage.
Analysis of information so far
From the information so far:
90% of my setups do not face this issue, while 10% do.
Therefore there must be 1 or more differences between the group without the problem ("good") and the group with the problem ("bad"). Look for those difference(s), since analysing those may lead to finding the underlying cause(s).
In addition to the usual suspects of differences (e.g. real differences in the hardware between the good & bad groups, perhaps components bought from different sources†) the difference(s) between the good & bad groups might be in how the devices are used - for example noise on the incoming power supply rail(s) depending on the specific PSU, length of power supply cabling (i.e. added inductance) connected to your devices, how often the devices are switched between main and battery power etc.
(† As Chupacabras kindly mentioned in a comment, some hardware suppliers are less trustworthy than others.)
As an estimate, I can say that the chip is on (5V) for roughly 40-50% of the time. Remaining time, it is running on the 3.3V battery.
I interpret this as meaning that, during the 3-4 months before problems might be seen, there is some switching of the main 5V power. That is critical information, as it leads to a hypothesis about a possible cause which RTC devices are particularly prone to suffer from - see below.
Your answer to Bruce Abbott's question is also important:
"Most of the times, reprogramming the chip works." - so sometimes, reprogramming the chip doesn't work? Why not?
I found that very weird. I then have to replace the chip and the crystal. Then it definitely works. Sometimes I try and replace only the crystal - and it works. Sometimes I also have to replace the chip.
That strongly implies hardware damage has occurred.
In my experience of RTCs, the main cause of (a) corrupt time (not just clock running fast or slow, which can have different causes) and (b) hardware damage, are spikes in the power supply to the RTC during "main power" on/off.
Notice how a typical RTC datasheet (including the one for the DS1307) has a line similar to this:
WARNING: Negative undershoots below -0.3V while the part is in battery-backed mode may cause loss of data.
The consequences can be worse than just a loss of RTC data. Latch-up may occur, potentially resulting in internal damage.
For more background on this topic, see this application note AN1549 from Intersil (now Renesas):
Addressing Power Issues in Real Time Clock Applications
and this application note AN504 from Maxim:
Design Considerations for Maxim Real-Time Clocks
(especially section "Data Loss/Data Corruption" on page 10)
and this previous question, where power-supply spikes when turning the main 5V supply off & on, affected the RTC part of an MCU - look at the oscilloscope traces added at the end of the question:
STM32F091 VBat pin sinking a lot of mA's
Fix to be considered and further tests
I could be wrong (until we see the full schematic), but I'm assuming that you don't have any other relevant components near the RTC, which are not shown in the snippet of the schematic in the question currently.
Therefore, as recommended in Intersil, Maxim & ST documentation, a suitable reverse-biased Schottky diode (e.g. BAT54) across the power pins of the RTC, to prevent any negative excursions below -0.3V may be required.
I suggest doing some experiments with a scope connected directly across the DS1307 VCC on one of your boards (absolute minimum ground wire length on your scope probe - preferably a ground spring - to minimise the added inductance affecting readings) while power-cycling the "main power" to the board. Use the same power supply source and power supply wiring as some of the "bad" (i.e. "affected/damaged") devices you have seen so far, to improve your chance of reproducing problems. What "spikes" on the DS1307 VCC (main power) do you see on the scope traces?
Another suggestion: It may be helpful if you were able to devise a test, which could trigger the problem more quickly than the current 3-4 months, as that would allow you to make changes and test whether they resolve the problem. As you see above, my hypothesis is that power-cycling the main power might be that trigger (with a power supply that was used on a device which eventually went "bad" etc.). That power cycling could be automated with a suitable test rig, which could also read the RTC values and detect if a problem has occurred (you have to define the failure criteria of course e.g. wrong year, or time not incrementing etc.).