Bug once in a while, but high priority

Question

I am working on a CNC (computer numerical control) project which cuts shapes into metal with help of laser.

Now my problem is once in a while (1-2 times in 20 odd days) the cutting goes wrong or not according to what is set.

But this causes loss so the client is not very happy about it.

I tried to find out the the cause of it by

Including log files
Debugging
Repeating the same environment.

But it wont repeat.

A pause and continue operation will again make it to run smoothly without the bug reappearing.

How do I tackle this issue? Should I state it as a Hardware Problem?

Welcome to the wonderful world of the [heisenbug](http://en.wikipedia.org/wiki/Heisenbug) *8') — Mark Booth, Apr 03 '12 at 10:14
When you say it happens 1 to 2 times in 20 days, does this mean that it takes about 20 days for it to appear or it sometimes appears after day 1, sometimes day 3 etc... — Dunk, Apr 03 '12 at 18:57
@Dunk there is no specific timing to it, but never appeared in a week twice so far. — Shirish11, Apr 04 '12 at 04:33
@Shirish - I was leaning towards a clock overflow problem not being handled properly which I've seen a couple of times on systems whose problem seems to occur every so many days and upon further inspection, exactly every so many days (or multiple thereof). — Dunk, Apr 10 '12 at 18:52
What is happening while the system is paused? What memory/counters/hardware are still changing? What about when you continue? It seems like whatever changes while you do those operations is a clue to the cause of the problem. — Dunk, Apr 10 '12 at 19:01

score 25 · Accepted Answer · edited Apr 12 '17 at 07:31

Work arounds

As ChrisF suggests, the pragmatic short term solution may be to use the pause and resume trick, but you have to talk to your customers to know what your priorities should be. For example:

If the fault trashes a £1000 part or causes 4 hours of downtime once a week, while the pause-resume fix reduces production by 1%, they will probably prefer the fix right now.
If the fault trashes a £1 part or causes 4 minutes of downtime once a week, but the pause-resume fix reduces production by 1%, they will probably prefer to wait for a fix which doesn't affect production rate.

Having worked in the laser micro-machining industry for many years, I know just how much pressure you can be under to optimise the process and make your machine produce as many parts per hour as is possible, so either way you are going to be under pressure to fix the problem properly.

Logging

In my experience, the only way to effectively track down a Heisenbug is copious logging. Log everything in and around the part of the code which could be responsible for the error. Learn how to read your log files effectively, make sure you are monitoring following error on your motors (are your stages moving where they should when they should?). Look at the memory usage on the machine, is a memory leak causing a critical process to be starved?

Make sure you are logging user actions too, are you sure that the operator isn't hitting the emergency stop so they can pop out for a shifty cigarette break while it's being fixed? I've seen this happen!

Static analysis

Also, look for correlations between scribing certain patterns and the bug being triggered more or less often. If you can find patterns that trigger the problem more frequently (or never trigger it) these may point to your problem.

Try to make patterns that trigger the problem even more frequently. If you can find a way to trigger the problem reliably then you are half the way to a solution.

Other options

Finally, don't be quick to blame the hardware, but never assume that it's perfect. Many times I've been blamed for problems which turned out to be electrical or mechanical in nature, so you always have to have that at the back of your mind.

Even though you may not normally have access to the machine, remember that some problems can only be efficiently solved on the machine. Sometimes a few days on-site can be worth weeks via remote desktop and months off-line completely. If you run out of off-line options, don't be afraid to propose a site visit, they can only say no.

You might also want to look at the questions and answers to What do you do with a heisenbug? and What to do with bugs that do not repro? but these might not be so useful for your situation.

more to add to my problem I dont have the hardware at my disposal. And the client is not that educated to understand these programming terms.So hanging on to his system remotely not possible. BTW thanks for the advice will try a work around. — Shirish11, Apr 03 '12 at 13:17

score 6 · Answer 2 · answered Apr 12 '13 at 15:03

I'm going to make an off-the-wall suggestion.

Go to the factory manager and ask to see the power line monitor records for that tool, or that area, for the times when the malfunctions occurred. Also ask him if there was any welding, or any other unusual activity, at around those times.

Several decades ago, my father was having a hell of a time with a minicomputer that was crashing for no reason at all. They called the manufacturer's customer rep.

The rep came into their office, in the factory area, and plugged a voltmeter into the wall, next to the mini, and then said "Watch this."

A few minutes later, the voltmeter suddenly sagged, significantly, then came back. The rep said "That was him striking his test arc. Wait a minute." Shortly after that, the voltmeter sagged again, and this time it stayed sagged.

The rep said "That's your problem. You've got a guy welding on the factory floor, and he's on the same power leg you are. I saw him setting up as I was walking in."

They had to run a completely separate power feed to the office.

Reminds me of this: https://thedailywtf.com/articles/that-70-s-paper-mill — cst1992, Sep 18 '17 at 11:59

score 4 · Answer 3 · answered Apr 03 '12 at 09:56

The problem is a real one with real consequences for the user - i.e. ruined work etc. so it needs fixing. However, it doesn't have to be fixed "properly". You state:

A pause and continue operation will again make it to run smoothly with the bug reappearing.

In that case just do this. The customer will be happy that they are not wasting material on defective runs even if normal runs take a couple of seconds longer.

Obviously in the long term you might need to fix this "properly" but for the time being cut your losses, go with the workaround and get onto something else.

score 4 · Answer 4 · answered Apr 03 '12 at 16:25

I had a bug in a game that happened only 1 time in a billion. Fortunately this meant I was seeing it every 15 to 30 minutes, but stepping through the code in the debuggger was not going to work. I ended up putting in debug messages. They needed to use fancy if-statements because I only wanted something when there was a problem. In most cases the debugging code was repeating calculations in the regular code but using different techniques. The repeats did not have to be precise. If I knew a number should always be under 10,000 and it seemed to hit 150,000 on occasion, I'd just check for a value over 100,000. Each time the bug occurred, I'd study my results, devise more elaborate debugging messages (or more precisely, more elaborate checks to see if I should display a message), and wait for the problem to arise again.

Your cycles are going to be a lot longer than mine were, but you will eventually close in on the problem. I do hope you can find the solution by some other, faster method, but this will catch it eventually if nothing else does, and will give you a sense that you are doing something until you come up with a better idea.

(In case it's helpful, I finally solved my problem by cleaning up the few lines of code I finally identified as the problem. I'll swear there was nothing wrong with them, but I think both the optimizer and the CPU were reordering instructions for performance, and I think once in a while they were taking chances to get a bit of extra speed. Even a single core multi-processes these days, and I think every great once in a a while a register got read before it was written to. I switched all calculations to work with local variables. "Instance field" values were moved to local variables right at the start, and the local values were moved back only at the very end, inside synchronization blocks. And I used a local value for the method return value rather than the "instance field" I had been using.)

+1 for sanity checking and the iterative improvement of logging messages to converge on the root of the problem. — Mark Booth, Apr 03 '12 at 17:17

score 1 · Answer 5 · answered Apr 03 '12 at 11:23

1

Rule 1 number one in debugging: you need a reproducible scenario.

If you don't have one, you should work on that first. Can you reproduce that bug in some kind of "simulation mode" of the machine, where no metal is actually cut? This seem to make sense here. Can you run several different cutting programs quickly and automatically, simulating the process of 20 days in a few minutes? That may increase the probability of the issue showing up.

Then, when you have such a scenario, the next step is to gather as much information as possible and actually start debugging.

answered Apr 03 '12 at 11:23

Doc Brown

199,015
33
367
565

**simulating the process of 20 days in a few minutes** thats not possible. I have to consider the hardware. – Shirish11 Apr 03 '12 at 11:58
2

I've never come across a *heisenbug* that could be reproduced using a *simulation mode*. The problems are almost always in the components which are *simulated out* or the coupling between them. As I said, if you can reliably reproduce the problem, you are half the way to a solution. – Mark Booth Apr 03 '12 at 14:15
@Shirish: "simulating the process in a few minutes" may be one extreme, but waiting 20 days for the bug to occur and cut a lot of metal to let the bug pop up is obviously the other extreme. Perhaps there's something possible in between. – Doc Brown Apr 03 '12 at 14:43
2

@shirish-if you haven't abstracted away the hardware so that it becomes possible to simulate it means that the design is lacking. It also means that your system could not have been adequately tested. Thus, it is no surprise that the system has issues. – Dunk Apr 03 '12 at 18:52
1

@Dunk - Have you ever worked in the laser scribing industry? You don't always have the luxury of a simulator and even if you had a good one, it wouldn't be cost effective to fully simulate all of the intricacies of a complex mechatronic system. Following error, velocity profiling, pulse tracking all at sub-micron precision, interactions between soft & hard real-time systems, Takt time pressure - simulating that lot in real-time would take a cluster, let alone doing it in 1/10,000 of real-time. Faster/better/cheaper - you can rarely have all three, so please try not to be so judgemental. – Mark Booth Apr 05 '12 at 21:29
@Mark - I've worked on more than my share of embedded real-time systems. So I am fully aware of what is required, including motor control. I am also fully aware of all the projects that I was brought in to solve the unsolvable problems and was told how how it can't be simulated. Frequently, I have solved the problem by adding stubs and simulators. In almost all cases, it doesn't need to be accurate at the timing levels (as you imply) since it is almost always software logic problems and not hw control problems. – Dunk Apr 10 '12 at 18:48
@Dunk - The fact that it **can** be done, doesn't make it **cost-effective** to do so. When your system is in orbit, it's almost certainly worth it, when it's just a long-haul flight away, it probably isn't. Please don't judge people as **incompetent** for the **business decisions** they have made (or had made for them). As to whether accurate timing is *necessary*, it certainly is with laser scribing systems - a 50us timing error when moving a stage at 2m/s, results in a 100um position error, if you are trying to get two 25um lines to meet, you've just missed and the *substrate is junk*. – Mark Booth Apr 11 '12 at 10:01
@Mark - my point is that I have worked with too many people who don't anticipate debugging while developing their sw. They say it is because they don't have time or can't be done since it talks to hardware. Meanwhile, the company ends up losing bonuses and has payback clauses that end up costing the company lots of money because they run into those impossible to reproduce problems. The funny thing is that it is always the same developers who run into those types of problems and I am usually dragged off my current project to figure out their mess. It sounds like they didn't take debugging into – Dunk Apr 13 '12 at 17:00
account in this particular project. So that is incompetence in my book. – Dunk Apr 13 '12 at 17:00

score 1 · Answer 6 · answered Apr 03 '12 at 17:31

1

Not sure what language this is run in, but if I experience erratic bugs in my code (C++), I will use a tool like valgrind or cppcheck to ensure nothing is going on memory-wise.

answered Apr 03 '12 at 17:31

Chance

511
3
8

score 0 · Answer 7 · answered Apr 12 '13 at 19:25

An extension on RalphChapin's answer:

Over the years I have had to hunt a fair number of bugs that only showed themselves on systems I couldn't duplicate because of attached hardware.

In addition to logging like crazy one other thing I found useful: Putting information on the screen showing where the code was and values of some relevant variables. When the problem showed up even the factory floor workers could read me the information.

It usually took a few rounds of refinement to pin it down exactly but it was very effective.

Bug once in a while, but high priority

7 Answers7

Work arounds

Logging

Static analysis

Other options

Linked

Related