Are there better alternatives to debug level logs to investigate a bug in a production environment?

Question

Some of our customers from time to time report an unexpected behavior in one feature of our software and we suspect that we have a bug.

The feature itself and the kind of bug is not interesting for the purpose of this discussion, but just to fix the idea the broken piece is a command scheduler. From time time to time scheduled commands are lost and they are not executed at the schedueled time of the day. We are not currently able to reproduce this issue in a controlled way.

By investigating the service in charge of implementing the broken feature we noticed that the current implementation has an insufficient number of logs and this makes understanding the runtime behavior quite difficult. So, we decided to improve the logging in order to have better insights about the runtime behavior in the customer installation.

While reasoning about this problem I asked myself a basic question: is it a good choice depending on debug level logs in order to have a full understanding of what's going in a software product ? Is there a better way to handle this kind of situation ?

The point is that no one is going to run a software in production by enabling debug level logs (at least not in standard scenarios). When debug level logs are enabled plenty of logs are written and this can harm the log store in terms of storage consumption and performance.

So, the first problem is that debug level logs are not enabled by default in production and this means that when a problem arises for the first time you don't have your precious logs which can help you fully understand what exactly happened. You just observe an unexpected behavior, but you don't have a clear idea of the root cause.

This point can be very harmful because, in many cases, the pattern to reproduce an unexpected behavior is unknown or not very clear. This means that, once the debug level logs are enabled in order to carry on the investigation, it is entirely possible that you won't be able to reproduce the issue observed before and you are stuck unable to understand the root cause.

Are there better alternatives than low level logs to handle these scenarios ?

Depending on the technology you are working with, you might be able to use remote debugging, which allows you to put breakpoints and step through your code in production — overloading, Mar 25 '20 at 20:13
Yes, sometimes you *very definitely **do*** "run logs in production!" — Mike Robinson, Mar 26 '20 at 20:20

Doc Brown · Accepted Answer · 2020-03-27T15:10:25.857

When you cannot reproduce a bug like a broken command scheduler outside of a production environment, and cannot mirror the production environment in your testing environment in a "similar enough" manner, logging, despite the drawbacks mentioned in the question, is IMHO still the best tool you have.

So in case you need "debug level logs", but your current logging mechanism will eat up to much storage or performance to be useful in production, I would recommend to optimize the logging. For example,

make sure you cannot only configure the logging "horizontally" (like "minimal", "standard", and "debugging" level), but also vertically (per "module"). That could be used to set the log level to "debugging" for the command scheduler, but keep it at "minimal" for the rest of the system
to optimize space, make sure debug logs are not longer kept than necessary. For example, if you can detect that a bug has occured within an hour after it occured, make it possible to delete "debug" log entries automatically which are older than, lets say, two hours.
optimize for speed by utilizing parallelization - maybe managing the log entries and writing them to disk can be done by an asynchronous logging service? Of course, that could become a source of extra errors, but it may be worth a try, if your logging system is not already implemented that way.

Optimization is very case-dependent, so there is not a one-size-fits-all solution for this, but I guess you get the general idea.

As an alternative to logging (or at least to "broad logging"), it could help if you try to implement a mechanics to detect when a specific bug occured automatically ASAP, and then write a post-mortem dump to disk. I don't mean this in the narrow sense of a persisted memory image of a crashed process, but as any kind of custom dump, implemented by yourself, containing all the information about the current state of the system which might be useful for finding the root cause of the issue.

How this has to look like, and if this will make sense for your case depends heavily on the details of the system, but there are definitely lots of systems where this has worked well in the past.

A variance of the post-mortem dump: Monolog (a logging library for PHP) has a so-called "fingers crossed" handler. It keeps lower level messages in memory until a high enough level gets logged, then the entire stack is persisted. Otherwise, the stack is discarded when it's no longer relevant. — Duroth, Mar 27 '20 at 15:37
In the software we use (vendor-supplied software) you can, while the software is running, turn on and off log levels for modules, functions, and even individual log lines. Each log line has an id that can be used to target that line, its function or its module. — Jerry Jeremiah, Sep 29 '21 at 22:31

score 1 · Answer 2 · answered Mar 25 '20 at 22:50

I don't know about "Best", but these come to mind:

First of all, your intuition is correct,
Yes, sometimes the operational logs fail to cover the first occurence well.
but it's goal is to catch it not debug it.

.NET's IntelliTrace Gives an idea to keep multiple logs:
- WarnLevel - kept forever
- Debug - pruned after # days
  It won't help with performance, but will clean-up the log-store
  (most loggers can do that easily by configuration)
One more idea is to make a "Dynamic" log level.
The log level rises as func of how many errors occur.
(Note: I think this solution is too complicated and not worth the effort)

Yeah, if you look at `/var/log` on any "production" Linux/Unix system, you'll see that it's stuffed with logs. Your laptop (Mac or PC or Linux) keeps logs. Your phone keeps logs. And then we have nifty tools like `logrotate` which can automatically "gzip" logs and eventually discard them. — Mike Robinson, Mar 26 '20 at 20:22

score 1 · Answer 3 · edited Mar 27 '20 at 12:55

The problem is:

You want to log everything if there is a problem.
You don’t want to log anything if there is no problem.
You don’t know if there is going to be a problem.

If your code can recognise that there is s problem after it happened, you could solve this with quite a bit of work:
You have logging code for everything. However, the logging code only stores the logging statements in a buffer, at minimal cost. You tell the logger when you start and finish a complex operation and when an error is detected. Only when the operation is finished and detected an error on the way, then you store all the logging statements in your persistent store, otherwise you only store production logging statements.

That system doesn’t have to be all or nothing: If you didn’t tell the logger that an operation started, it ignores debug logs and only writes production logs, as it does now.

score 0 · Answer 4 · edited Sep 27 '21 at 13:53

I see that logging may help to find the bug and a good analyzer is someone who know how to read his log. I will tell you the example with web server, if you are getting access log from a web server, you will try to use an analyzer tool and there is a wide list of tools. Personally I suggest Awstats, it is a good one, it helps you to identify the pic and the different metrics related to your log file.

By identifying the source of pic of number of hits at one time, you will know well how to debug and when and where is your problem related!

score 0 · Answer 5 · answered Sep 27 '21 at 14:38

While the answer may depend somewhat on the details of the issue, in most situations I rely most of all on metrics, with logs being only a second choice, or used in situations where more details are needed than metrics can provide (e.g. particular IDs of objects being processed). Metrics can provide a good picture of what's going on in the system and at the same time, being just series of numbers, they are much cheaper than logs performance-wise, so you can (and should) collect them all the time.

Designing metrics for your application may take some though, but since having even hundreds of them is usually not an issue, you should meter all "important" things going on in the system, for example if you process some object in multiple steps, you should - at each stage - measure the number of incoming objects, correctly processed objects, and errors. This way if anything goes wrong, you will see which stage causes the problem and be able to narrow down your search (possibly checking the logs, or maybe just other metrics). With all operations heavily metered, you can usually find out what's going on from metrics alone.

Are there better alternatives to debug level logs to investigate a bug in a production environment?

5 Answers5