Capturing warnings when batch processing is ignored

Question

My client has a process which iterates over a number of actions that may or may not apply to a users portfolio. Quite frequently, processing of an action may give up and jump to the next action or portfolio without the process stopping (i.e. straight through processing).

Sometimes the decision which lead to the action processing being curtailed is written to the log and sometimes not. Even when it is, this is normally deliberately written at Info level rather than Error or Warning. There's a good reason for this: Errors and Warnings are also picked up by other processing rules and automatic tickets are generated. If we let this happen, we would be flooded with tickets, many of which would be unnecessary.

The problem is that occasionally a user will not unreasonably ask why an action was not carried out on their portfolio, and they normally do not ask until a few days or weeks after the processing has happened by which time the logs have often been purged.

I have suggested that we should introduce a more robust system for capturing these problems at the time they happen and have thought of the following:

Simply improve the verbosity of the logging. Nice and easy but doesn't help with the log purging issue and doesn't make the problems any more obvious.
Add a separate process to scan and parse the logs, picking up such issues. Benefit is that no or little code change is needed, but it may not be very obvious that this process is running or exists and it could be undermined if the logging messages are changed causing pattern matching to fail. Could be a maintenance headache.
Alter the code to capture the messages separately to a database table. More robust and obvious in the code but the change is bigger and the design of the new database table might become a challenge (i.e. should it just be a free-text field with a date-time index or something more structured. The structure could become more complicated if parameterisation is needed - and XML blob maybe?

Other aspects that drive the design are: How easy can we use this information? (i.e. could we present it to the user themselves to prevent them calling us in the first place?) Do we need to create a UI to see / monitor it? Can we quickly search the data and access the details for the specific issue in question?

If anyone has any experience of creating (or better still bolting on) such functionality, I'd be very interested to hear of experiences etc.

The tech stak we're using is C++ (backend) + simple SQL database and Javascript / web front end.

However, I'm not so bothered about tech solutions as which solution is (a) Most effective for the least effort, (b) most maintainable and (c) adds most value in terms of reduce the time spent turning around enquiry tickets.

Why not just avoid purging the logs? Keep them somewhere longer. — Frank Hileman, Feb 27 '17 at 19:04
Two reasons really. Firstly the logs take up hundreds of megabytes of data and we are under pressure to reduce the amount of disk space being used. Secondly, the logs contain a lot more than just these messages and is not always easy to find what you're looking for. Some of this can be improved by better referencing in existing logs and adding logging where is not currently there, however our experience has been that trawling through logs is not an efficient / quick way to identify what happened. — Component 10, Feb 27 '17 at 19:12
Are you using a logging framework? If so, which one? Some logging frameworks allow different messages, or types of messages to be redirected to different files. You may have to alter the code that writes the messages of interest. It might be worth checking to see if that sort of feature is supported. I think in Log4j this is called "Routing", not sure if that's a standard name for it. — FrustratedWithFormsDesigner, Feb 27 '17 at 19:37
Hundreds of Megabytes? Seems to be not enough to matter. Why not just buy a bigger disk? — Christian Sauer, Feb 28 '17 at 07:38
FrustratedWithFormsDesigner: Yes that's a possibility. The client has their own proprietary logging and it may be possible to hook into that to perform actions on certain log entries. I'm a little concerned by the fragility of the link (ie. If someone changes the log entries without realising their significance causing the triggering to silently fail) — Component 10, Feb 28 '17 at 08:18
ChristianSauer: because (a) that's not within my control and (b) it doesn't address the issue of how we capture these specific problems in an easily accessible form. I've considered using something like Splunk to index the log files and create some custom searches but I've found that log indexing products tend to need regular maintenance to keep them working well and I'd rather reduce the support burden, not add to it. — Component 10, Feb 28 '17 at 08:25

score 3 · Answer 1 · answered Feb 27 '17 at 18:11

3

I would argue that this calls for a new service which is not intermingled with logging since it seems to be another reason. If you want to keep the logging you could throw a high-throughput solution to this problem - for example, the ELK stack. This would allow to store nearly all logs while being very searchable, but has the drawback of being more maintenance.

answered Feb 27 '17 at 18:11

Christian Sauer

1,269
1
9
16

Thanks. I really like the idea of this although it might be asking a lot of the client management to adopt the ELK stack just for this use case. I will put the suggestion though, as it may have much greater benefits on a holistic level. – Component 10 Feb 28 '17 at 08:28
@Component10 They could use docker, simplifies such deployments a lot. – Christian Sauer Feb 28 '17 at 14:59

Capturing warnings when batch processing is ignored

1 Answers1