39

A common pattern for locating a bug follows this script:

  1. Observe weirdness, for example, no output or a hanging program.
  2. Locate relevant message in log or program output, for example, "Could not find Foo". (The following is only relevant if this is the path taken to locate the bug. If a stack trace or other debugging information is readily available that’s another story.)
  3. Locate code where the message is printed.
  4. Debug the code between the first place Foo enters (or should enter) the picture and where the message is printed.

That third step is where the debugging process often grinds to a halt because there are many places in the code where "Could not find Foo" (or a templated string Could not find {name}) is printed. In fact, several times a spelling mistake helped me find the actual location much faster than I otherwise would - it made the message unique across the entire system and often across the world, resulting in a relevant search engine hit immediately.

The obvious conclusion from this is that we should use globally unique message IDs in the code, hard coding it as part of the message string, and possibly verifying that there’s only one occurrence of each ID in the code base. In terms of maintainability, what does this community think are the most important pros and cons of this approach, and how would you implement this or otherwise ensure that implementing it never becomes necessary (assuming that the software will always have bugs)?

l0b0
  • 11,014
  • 2
  • 43
  • 47
  • 54
    Make use of your stack traces instead. The stack trace will not only tell you precisely where the error occurred, but also every function that called every function that called it. Log the entire trace when an exception occurs, if necessary. If you're working in a language that doesn't have exceptions, like C, that's a different story. – Robert Harvey Jan 30 '18 at 03:44
  • 6
    @l0b0 a small advice on wording. "what does this community think ... pros and cons" are phrases that may be seen as too broad. This is a site that allows for "good subjective" questions, and in return for allowing this type of questions, you as the OP would be expected to do the work of "shepherding" the comments and answers toward a meaningful consensus. – rwong Jan 30 '18 at 06:49
  • @rwong Thank you! I feel that the question has already received a very good and on point response, although this might have been better asked in a forum. I retracted my response to RobertHarvey's comment after reading JohnWu's clarifying response, in case that's what you're referring to. If not, do you have any specific shepherding tips? – l0b0 Jan 30 '18 at 06:55
  • @l0b0 Overall I think this is a good question, and the technique you mentioned is valuable overall and deserves more awareness among programmers. Just be attentive to requests for clarifications or practical use cases for example. – rwong Jan 30 '18 at 07:11
  • 1
    My messages look like "Could not find Foo during call to bar()". Problem solved. Shrug. Downside is it's a bit leaky for being seen by customers, but we tend to hide the detail of error messages from them anyway, making it available only to sysadmins who couldn't give a monkeys that they get to see some function names. Failing that, yes, a nice little unique ID/code will do the trick. – Lightness Races in Orbit Jan 31 '18 at 00:16
  • If you code in C or C++, be aware of `__FILE__` and `__LINE__`. PHP offers these too, and more. Other languages might have something similar – Mawg says reinstate Monica Jan 31 '18 at 13:13
  • 1
    This is VERY useful when a customer phones you up, and their computer is not running in English! Much less of an issue these days as we now have email and log files..... – Ian Jan 31 '18 at 13:19

6 Answers6

58

Imagine you have a trivial utility function that is used in hundreds of places in your code:

decimal Inverse(decimal input)
{
    return 1 / input;
}

If we were to do as you suggest, we might write

decimal Inverse(decimal input)
{
    try 
    {
        return 1 / input;
    }
    catch(Exception ex)
    {
        log.Write("Error 27349262 occurred.");
    }
}

An error that could occur is if the input were zero; this would result in a divide by zero exception.

So let's say you see 27349262 in your output or your logs. Where do you look to find the code that passed the zero value? Remember, the function-- with its unique ID-- is used in hundreds of places. So you while you may know that division by zero occurred, you have no idea whose 0 it is.

Seems to me if you're going to bother logging the message IDs, you may as well log the stack trace.

If the verbosity of the stack trace is what bothers you, you don't have to dump it as a string the way the runtime gives it to you. You can customize it. For example, if you wanted an abbreviated stack trace going only to n levels, you could write something like this (if you use c#):

static class ExtensionMethods
{
    public static string LimitedStackTrace(this Exception input, int layers)
    {
        return string.Join
        (
            ">",
            new StackTrace(input)
                .GetFrames()
                .Take(layers)
                .Select
                (
                    f => f.GetMethod()
                )
                .Select
                (
                    m => string.Format
                    (
                        "{0}.{1}", 
                        m.DeclaringType, 
                        m.Name
                    )
                )
                .Reverse()
        );
    }
}

And use it like this:

public class Haystack
{
    public static void Needle()
    {
        throw new Exception("ZOMG WHERE DID I GO WRONG???!");
    }

    private static void Test()
    {
        Needle();
    }

    public static void Main()
    {
        try
        {
            Test();
        }
        catch(System.Exception e)
        {
            //Get 3 levels of stack trace
            Console.WriteLine
            (
                "Error '{0}' at {1}", 
                e.Message, 
                e.LimitedStackTrace(3)
            );  
        }
    }
}

Output:

Error 'ZOMG WHERE DID I GO WRONG???!' at Haystack.Main>Haystack.Test>Haystack.Needle

Maybe easier than maintaining message IDs, and more flexible.

Steal my code from DotNetFiddle

John Wu
  • 26,032
  • 10
  • 63
  • 84
  • The OP suggests that these numbers are, in fact, unique throughout the code base, in which case a stack trace is a much better bet anyway. – Robert Harvey Jan 30 '18 at 03:44
  • 32
    Hmm I guess I didn't make my point clearly enough. I know they're unique Robert-- per code *location.* They are not unique per code *path.* Knowing the location is often useless, e.g. if the true problem is that an input was not set properly. I've edited my language slightly to emphasize. – John Wu Jan 30 '18 at 04:15
  • 1
    Good points, both of you. There's a different issue with stack traces, which may or may not be a deal breaker depending on the situation: Their size can result in them swamping the messages, especially if you want to include the *entire* stack trace rather than a shortened version like some languages do by default. Maybe an alternative would be to write a stack trace log *separately* and include numbered indexes to that log in the application output. – l0b0 Jan 30 '18 at 06:51
  • @l0b0 in other words, we would like to *compress* the log (in terms of the CPU and I/O and RAM overhead) while not discarding anything of importance. We would like this compression to be reversible, or lossless. – rwong Jan 30 '18 at 07:27
  • 12
    If you're getting so many of these that you're worried about flooding your I/O, there's something seriously wrong. Or are you just being stingy? The real performance hit is probably the stack unwind. – John Wu Jan 30 '18 at 09:09
  • 1
    Compressing stack traces in memory is easy - just deflate them. What's more interesting is how to compress them on screen. Base64 of deflate isn't particularly readable but then again does it need to be? – John Dvorak Jan 30 '18 at 09:14
  • 9
    Edited with a solution for shortening stack traces, in case you are writing logs to a 3.5 floppy ;) – John Wu Jan 30 '18 at 09:23
  • 7
    @JohnWu And also don't forget "IOException 'File not Found' at [...]" that tells you about fifty layers of the call stack but doesn't tell what exact bloody file was not found. – Joker_vD Jan 30 '18 at 20:33
13

Overall this is a valid and valuable strategy. Here are some thoughts.

This strategy is also known as "telemetry" in the sense that when all such information are combined, they help "triangulate" the execution trace and allow a troubleshooter to make sense of what the user/application is trying to accomplish, and what actually happened.

Some essential pieces of data that must be collected (which we all know) are:

  • Location of code, i.e. call stack and the approximate line of code
    • "Approximate line of code" is not needed if functions are reasonably decomposed into suitably small units.
  • Any pieces of data which are pertinent to the success/failure of the function
  • A high-level "command" which can nail down what the human user / external agent / API user is trying to accomplish.
    • The idea is that a software will accept and process commands coming from somewhere.
    • During this process, dozens to hundreds to thousands of function calls may have taken place.
    • We would like any telemetry generated throughout this process to be traceable back to the highest-level command which triggers this process.
    • For web-based systems, the original HTTP request and its data would be an example of such "high-level request information"
    • For GUI systems, the user clicking on something would fit this description.

Often times, traditional logging approaches falls short, because of the failure to trace a low-level log message back to the highest-level command that triggers it. A stack trace only captures the names of the higher-up functions that helped handle the highest-level command, not the details (data) that are sometimes needed to characterize that command.

Normally software wasn't written to implement this kind of traceability requirements. This makes correlating the low-level message to the high-level command more difficult. The problem is particularly worse in freely multi-threaded systems, where many requests and responses can overlap, and the processing may be offloaded to a different thread than the original request-receiving thread.

Thus, to get the most value from telemetry, changes to the overall software architecture will be needed. Most interfaces and function calls will need to be modified to accept and propagate a "tracer" argument.

Even utility functions will need to add a "tracer" argument, so that if it does fail, the log message will allow itself to be correlated with a certain high-level command.

Another failure that will make telemetry tracing difficult is missing object references (null pointers or references). When some crucial piece of data is missing, it could be impossible to report anything useful for the failure.

In terms of writing the log messages:

  • Some software projects might require localization (translation into a foreign language) even for log messages only intended for administrators.
  • Some software projects might need clear separation between sensitive data and non-sensitive data, even for the purpose of logging, and that administrators wouldn't have a chance of accidentally seeing certain sensitive data.
  • Do not try to obfuscate the error message. That would undermine the trust of the customers. The customers' administrators expect to read those logs and make sense of it. Do not make them feel that there is some proprietary secret that must be hidden from the customers' administrators.
  • Do expect that customers will bring a piece of telemetry log and grill your technical support staff. They expect to know. Train your technical support staff to explain the telemetry log correctly.
rwong
  • 16,695
  • 3
  • 33
  • 81
  • 1
    Indeed, AOP has touted, primarily, its inherent ability to solve this problem - adding Tracer to every relevant call - with minimal invasion to the code base. – bishop Jan 30 '18 at 14:18
  • I would add, also, to the list of "writing log messages" that it's important to characterize the failure in terms of "why" and "how to fix" instead of just "what" happened. – bishop Jan 30 '18 at 14:20
  • Thank you for this really good answer. I've read tons of questions-answers, documentation, best practises for days but this answer just made everything clear to me on how I should log. – Shino Lex Apr 05 '22 at 07:20
6

SAP NetWeaver is doing this for decades.

It has proven to be a valuable tool when troubleshooting errors in the massive code behemoth that is the typical SAP ERP system.

Error messages are managed in a central repository where each message is identified by its message class and message number.

When you want to ouput an error message, you only state class, number, severity and message-specific variables. The text representation of the message is created at runtime. You usually see the message class and number in any context where messages appear. This has several neat effects:

  • You can automatically find any lines of code in the ABAP codebase which create a specific error message.

  • You can set dynamic debugger breakpoints which trigger when a specific error message is generated.

  • You can look up errors in the SAP knowledge base articles and get more relevant search results than if you look for "Could not find Foo".

  • The text representations of messages are translatable. So by encouraging the use of messages instead of strings you also get i18n capabilities.

An example of an error popup with message number:

error1

Looking up that error in the error repository:

error2

Find it in the codebase:

error3

However, there are drawbacks. As you can see, these lines of code aren't self-documenting anymore. When you read sourcecode and see a MESSAGE statement like those in the screenshot above, you can only infer from context what it actually means. Also, sometimes people implement custom error handlers which receive the message class and number at runtime. In that case the error can't be found automatically or can't be found in the location where the error actually occurred. The workaround for the first problem is to make it a habit to always add a comment in the sourcecode telling the reader what the message means. The second is solved by adding some dead code to make sure the automatic message lookup works. Example:

" Do not use special characters
my_custom_error_handler->post_error( class = 'EU' number = '271').
IF 1 = 2.
   MESSAGE e271(eu).
ENDIF.    

But there are some situations where this is not possible. There are for example some UI-based business process modeling tools where you can configure error messages to appear when business rules are violated. The implementation of those tools are completely data-driven, so these errors won't show up in the where-used list. That means relying too much on the where-used list when trying to find the cause of an error can be a red herring.

Philipp
  • 23,166
  • 6
  • 61
  • 67
  • Message catalogs also have been part of [GNU/Linux](https://www.gnu.org/software/libc/manual/html_node/Message-catalogs-a-la-X_002fOpen.html#Message-catalogs-a-la-X_002fOpen) - and UNIX generally as a [POSIX standard](http://pubs.opengroup.org/onlinepubs/9699919799/functions/catopen.html) - for some time. – bishop Jan 30 '18 at 14:24
  • @bishop I am usually not programming specifically for POSIX systems, so I am not familiar with it. Maybe you could post another answer which explains the POSIX message catalogs and what the OP could learn from their implementation. – Philipp Jan 30 '18 at 14:29
  • 3
    I was part of a project that did this back in the oughties. One issue we ran into was that, along with everything else, we put the human message for "could not connect to the database" in the database. – JimmyJames Jan 30 '18 at 19:07
5

The problem with that approach is that it leads to ever more detailed logging. 99.9999% of which you will never look at.

Instead, I recommend capturing the state at the begining of your process and the success/failure of the process.

This allows you to reproduce the bug locally, stepping through the code and limits your logging to two places per process. eg.

OrderPlaced {id:xyz; ...order data..}
OrderPlaced {id:xyz; ...Fail, ErrorMessage..}

Now I can use the exact same state on my dev machine to reproduce the error, stepping through the code in my debugger and writing a new unit test to confirm the fix.

In Addition, I can if required avoid more logging by only logging failures or keeping the state elsewhere (database? message queue?)

Obviously we have to be extra careful about logging sensitive data. So this works particularly well if your solution is using message queues or the event store pattern. As the log need only say "Message xyz Failed"

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • Putting sensitive data in a queue is still logging it. This is ill-advised, just as storing sensitive inputs in the DB without some form of cryptography is. – jpmc26 Jan 30 '18 at 23:25
  • if your system runs off queues or a db then the data is already there, and so should the security. Logging too much is only bad because the log tends to fall outside your security controls. – Ewan Jan 31 '18 at 03:54
  • Right, but that's the point. It's ill advised *because* that data remains there permanently and usually in completely clear text. For sensitive data, it's better to just not take the risk and minimize where you're storing it period, and then to be very aware and very careful of how you're storing it. – jpmc26 Jan 31 '18 at 04:39
  • Its traditionally permanent because you are writing to a file. But an error queue is transient. – Ewan Jan 31 '18 at 04:41
  • I would say that probably depends on the implementation (and possibly even the settings) of the queue. You can't just dump it in any queue and expect it to be safe. And what happens after the queue gets consumed? The logs must still be somewhere for someone to view. Additionally, that isn't an extra attack vector I'd want to open up even temporarily. If an attack finds out there's sensitive data going there, even the most recent entries might be valuable. And then there's the risk of someone not knowing and flipping a switch so it starts logging to disk, too. It's just a can of worms. – jpmc26 Jan 31 '18 at 05:40
  • My point is if you are already persisting sensitive data somewhere. Which you undoubtedly are. Then you log can just point at that location. The key is to program your system in such a way that the state of the individual processes is maintained when an error occurs. Yes blindly logging the whole thing is bad. I cover that in my answer. No it is not impossible to do and maintain security at the same time. – Ewan Jan 31 '18 at 05:49
  • additionally not all data is sensitive. your debugging will not be overly harmed if for example you get customer id rather than their name and address, but do get the rest of the order – Ewan Jan 31 '18 at 05:51
1

I would suggest that logging is not the way to go about this, but rather that this circumstance is considered exceptional (it locks your program) and an exception should be thrown. Say your code was:

public Foo GetFoo() {

     //Expecting that this should never by null.
     var aFoo = ....;

     if (aFoo == null) Log("Could not find Foo.");

     return aFoo;
}

It sounds like you calling code is not set up to deal with the fact that Foo does not exist and you could should potentially be:

public Foo GetFooById(int id) {
     var aFoo = ....;

     if (aFoo == null) throw new ApplicationException("Could not find Foo for ID: " + id);

     return aFoo;
}

And this will return a stack trace along with the exception that can be used to help debugging.

Alternately, if we expect that Foo can be null when retreived and that is fine, the we need to fix the calling sites:

void DoSomeFoo(Foo aFoo) {

    //Guard checks on your input - complete with stack trace!
    if (aFoo == null) throw new ArgumentNullException(nameof(aFoo));

    ... operations on Foo...
}

The fact that your software hangs or acts 'weirdly' under unexpected circumstances seems wrong to me - if you need a Foo and can't handle it not being there, then it seems better to crash out than attempt to proceed along a path that may corrupt your system.

Paddy
  • 2,623
  • 16
  • 17
0

Proper logging libraries do provide mechanisms of extension, so if you wanted to know the method where a log message originated, they can do that out of the box. It does have an impact on execution since the process requires generating a stack trace and traversing it until you are out of the logging library.

That said, it really depends on what you want your ID to do for you:

  • Correlate error messages provided to the user to your logs?
  • Provide notation on what code was executing when the message was generated?
  • Keep track of machine name and service instance?
  • Keep track of the thread id?

All of these things can be done out of the box with proper logging software (i.e. not Console.WriteLine() or Debug.WriteLine()).

Personally, what's more important is the ability to reconstruct paths of execution. That's what tools like Zipkin are designed to accomplish. One ID to trace the behavior of one user action throughout the system. By putting your logs in a central search engine, you can not only find the longest running actions, but call up the logs that apply to that one action (like the ELK stack).

Opaque IDs that change with every message are not very useful. A consistent ID used to trace behavior through an entire suite of microservices... immensely useful.

Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160