What's the best practice to do SOA exception handling?

Question

Here's some interesting debate going on between me and my colleague when coming to handle SOA exceptions:

On one side, I support what Juval Lowy said in Programming WCF Services 3rd Edition:

As stated at the beginning of this chapter, it is a common illusion that clients care about errors or have anything meaningful to do when they occur. Any attempt to bake such capabilities into the client creates an inordinate degree of coupling between the client and the object, raising serious design questions. How could the client possibly know more about the error than the service, unless it is tightly coupled to it? What if the error originated several layers below the service—should the client be coupled to those lowlevel layers? Should the client try the call again? How often and how frequently? Should the client inform the user of the error? Is there a user? By having all service exceptions be indistinguishable from one another, WCF decouples the client from the service. The less the client knows about what happened on the service side, the more decoupled the interaction will be.

On the other side, here's what my colleague suggest:

I believe it’s simply incorrect, as it does not align with best practices in building a service oriented architecture and it ignores the general idea that there are problems that users are able to recover from, such as not keying a value correctly. If we considered only systems exceptions, perhaps this idea holds, but systems exceptions are only part of the exception domain. User recoverable exceptions are the other part of the domain and are likely to happen on a regular basis. I believe the correct way to build a service oriented architecture is to map user recoverable situations to checked exceptions, then to marshall each checked exception back to the client as a unique exception that client application programmers are able to handle appropriately. Marshall all runtime exceptions back to the client as a system exception, along with the stack trace so that it is easy to troubleshoot the root cause.

I'd like to know what you think about this.

score 10 · Accepted Answer · answered Jun 23 '11 at 02:58

User recoverable exceptions should not be "exceptions". Exceptions are for exceptional circumstances. Transposing a few letters in a form field is something that you should expect and plan for.

Part of the impetus behind a "Service-Oriented Architecture" is that services are reusable. Sure, it might be a client sending messages to it... or it might be another service, or an orchestration engine, or an event subscriber, or an automated task or batch job. These actors can't possibly be able to reliably recover from a fault, no matter how much detail you put into it. In many cases they may even be using one-way messaging (i.e. MSMQ), in which case you're not even allowed to send a fault back; there's simply no channel for it.

Once a service has made the decision to send back a fault message, assuming that the originator can actually receive it, then all the originator can sensibly do is roll back the transaction it's in - if it was smart enough to enlist in one.

Juval is exactly right. Marshaling fault messages into client exceptions is fine when you've exhausted all other options (i.e. unhandled exception), but there is no point in the service trying to provide all kinds of detail. None. Users will not read or understand the error message, and if you think having a stack trace is a benefit from the user perspective then you don't understand the first thing about usability.

Microsoft actually tells you to put exception detail in faults. But don't. Please don't. It just encourages you to be lazy and fault when you really should be handling the errors. I've been down that road and it is one of never-ending pain and misery. It's especially pernicious in WCF because faulting permanently invalidates the service proxy, and it's actually very difficult to design client apps to recover from this, particularly if you're following other "best practices" and doing dependency injection.

What you should - nay, must be doing is logging all errors on the service side, generally into persistent storage, and sending notifications as bug reports. More sophisticated, service-bus architectures will even have an error queue which holds all of the original messages that caused the errors - but at the very least, you want the errors themselves. You want them - not your users. Don't rely on them to give you the stack traces, because if you do, then you have already failed them.

"User recoverable exceptions" simply do not exist in an SOA. There is no such thing because you can't know in advance who the "user" is going to be. If an exception is recoverable then it should be part of the message - for example, in XML form:

<customerUpdateResponse customerId="123" status="notUpdated">
    <validationErrors>
        <requiredFieldMissing field="fullName"/>
        <maxLengthExceeded field="phone" maxLength="30" actualLength="45"/>
    </validationErrors>
</customer>

This is just off the top of my head, but hopefully you get the idea; if an operation can fail for known, documented reasons then that "failure" becomes part of the specification. In this case, the message is sending back an event saying what happened, and the client application can interpret this data appropriately. The important thing is that it is part of the contract, not some unexpected "stop the presses" error.

Now I know that WCF lets you use fault contracts and so on, but honestly, I don't see the point, it's just adding complexity where it's not really needed. SOAP faults are, honestly, a pain in the butt to deal with from any angle.

As mentioned earlier, you also have to carefully plan for the case where you can't send any response. Fledgling "SOAs" with a smattering of web services tend to be predominantly RPC style, but that's actually a poor strategy for designing a robust high-performance architecture. The killer feature of an SOA, in my opinion at least, is publish-subscribe, which allows you to totally decouple the services themselves and only ever share messages. But this comes at a cost: you have to dispense with two-way communication. If a service wants to fault after consuming an event, well, great, but nobody's going to be listening. Which means that proper logging and exception notification is really, really important.

A good overall strategy for the second case is to define a generalized message type for unrecoverable errors (technically you could just use the FaultException) and install a component in the pipeline which forwards all faults to a fault queue, thus (a) ensuring that you don't lose any, and (b) collecting them all into a central location, which will make your life a whole lot easier when you have 30 different web services on 10 different servers. It's really very easy to set up a global exception handler in WCF - just attach to the Faulted event of the ServiceHost. You can also install your own IErrorHandler to do all of this before the fault ever happens - your choice.

But in summary: Instrument your systems so that you can resolve serious issues proactively and don't fault for recoverable errors. To the end user, downtime is downtime; make the exception details discoverable for developers and support staff but don't leak them to users.

The first phrase is just magic, I fully agree – Davi Fiamenghi Feb 17 '12 at 04:33 — Davi Fiamenghi, Feb 17 '12 at 04:33

score 0 · Answer 2 · answered May 09 '12 at 18:13

A service that does not provide any status on its health or the execution of a transaction is like a black box - the service consumer and the administrators who monitor the environment cannot tell if things are working.

There are several techniques for instrumenting services for monitoring. You can implement a heart beat whereby the service sends an "I am alive" message that can be picked up by a monitor. This is a simple task but does not really tell you if the service is working - only that it is running and has the ability to send a message. This however, is more that an OS level monitor will tell you. You can set up the monitor to automatically register the service for monitoring when it sees the heart beat the first time.

You can also develop a monitoring interface I call a "ping" interface. You send the ping interface a message and the interface can do some internal verifications of functionality and respond. This tells you more than the service is simply up and running. You can also build in the support of a "synthetic transaction" where you send specific transactions to a set of services and they execute the transaction but with no impact to the business - like updating a test account. But, to tell if a service is running properly in the context of all business transactions you have to develop a framework for application-level monitoring.

Application-level monitoring requires status data about each business transaction. The ability to monitor an application in a business or transactional context requires a monitoring interface within the service to send messages detailing the transaction's status specific to the service invocation. This requires each service to send a status message at critical steps in the business transaction. You can then build a real-time viewer to correlate status messages (based on the semantics of the message - e.g. transaction ID) with the services within the composite application. This provides an end-to-end view of the business transaction for SLA management, fault tracing and problem determination.

Here is a set of utilities that help implement this framework. I did help build these utilities.

What's the best practice to do SOA exception handling?

2 Answers2