Some of our customers from time to time report an unexpected behavior in one feature of our software and we suspect that we have a bug.
The feature itself and the kind of bug is not interesting for the purpose of this discussion, but just to fix the idea the broken piece is a command scheduler. From time time to time scheduled commands are lost and they are not executed at the schedueled time of the day. We are not currently able to reproduce this issue in a controlled way.
By investigating the service in charge of implementing the broken feature we noticed that the current implementation has an insufficient number of logs and this makes understanding the runtime behavior quite difficult. So, we decided to improve the logging in order to have better insights about the runtime behavior in the customer installation.
While reasoning about this problem I asked myself a basic question: is it a good choice depending on debug level logs in order to have a full understanding of what's going in a software product ? Is there a better way to handle this kind of situation ?
The point is that no one is going to run a software in production by enabling debug level logs (at least not in standard scenarios). When debug level logs are enabled plenty of logs are written and this can harm the log store in terms of storage consumption and performance.
So, the first problem is that debug level logs are not enabled by default in production and this means that when a problem arises for the first time you don't have your precious logs which can help you fully understand what exactly happened. You just observe an unexpected behavior, but you don't have a clear idea of the root cause.
This point can be very harmful because, in many cases, the pattern to reproduce an unexpected behavior is unknown or not very clear. This means that, once the debug level logs are enabled in order to carry on the investigation, it is entirely possible that you won't be able to reproduce the issue observed before and you are stuck unable to understand the root cause.
Are there better alternatives than low level logs to handle these scenarios ?