Code and error handling strategies and FSMs

We’ve had a discussion in the office on and off over a few days about error handling in programmes. Neither my office mate nor I are completely happy with what’s available out there. The state of the art in imperative/OO languages appears to be stuck at exceptions. The new Go programming language eschews them, essentially going back to having a named parameter for returning errors – which needs to be checked for.

The essential problem in programming is, of course, that code may follow a number of paths. Some of those paths are, at some level, the most common or desired paths (often, just 1 path). Other paths may be taken, e.g. because a resource was unavailable or because the requested action was inappropriate or unreasonable in some way. Of those other paths, some may still be reasonably commonly taken (e.g. a file not existing) and/or recoverable in some way. Other paths may be exceptional in some way such that it is unlikely  to recover from having taken them. These paths converge at function/procedure call nodes, where they have to be dispatched to direct programme flow onward to the appropriate further path.

My own sense is that path selection syntax/support in programming languages tends to divide all error/return handling into 1 of those 3 cases.

  1. The primary path(s)
  2. Recoverable error paths
  3. Unrecoverable error paths

What then are the ways to handle these cases?  In languages such as C, errors generally must be handled after each function call, acting on them according to return value or a modified parameter. There are of course other ways, such as signals and exceptions.

There is also another way, which I see much less often, but which has certain benefits, which is to use Finite State Machines (FSMs) to give objects an idempotent error state. This allows control-flow in calling code to bunch together operations without having to worry about errors, as the error-handling can be done later, separately.

This blog tries to go over these methods, and their benefits and pitfalls. I’d also be really interested to hear about other error-handling strategies, especially better ones! 🙂

Signals

A small number of operating system or hardware level errors may be indicated through signals (e.g. SIGSEGV, SIGFPE, SIGBUS on Unix systems) – affecting all languages. The programme declares special functions, with restricted execution contexts (separate, limited stack) to act as signal handlers. The operating system invokes the function in response to a signal. The signal handler can then do a (limited) amount of work to handle the error, typically it must then somehow notify or co-ordinate with the rest of the software to have the bulk of the error handling done. This may involve translating the signal into another form of error notification/handler (e.g. an error flag to be checked, an exception, etc.) within the normal execution of the software. That is, if the error can be handled at all, often the response to a signal is to run a crash handler and die.

As signal handlers run under constraints that limit them significantly in what they can do, they are less than popular these days.

Exceptions

In order to try reduce the amount of explicit condition checking needed, some languages added support for exceptions and try/catch syntax. The idea being that error handling could be moved out of the primary path code, so decluttering the primary path code.

Rather than switching control-flow explicitly in the primary path, as might be done with return values or modified parameters from functions, exceptions are switched implicitly. Blocks of error-handling code are associated with ranges of primary path code and exception codes. When an exception occurs, the runtime goes up the stack, looking for a (programme-counter, exception) that matches, and jumping to that exception block with the appropriate stack frame as context. This search may be costly.

The language syntax support often looks something like:

try {
  // primary path code
} catch (ex1) {
  // path after exception condition ex1
} catch (ex2) {
  // path after exception condition ex2
}

Caught conditions often correspond to recoverable or non-catastrophic error paths, with any remaining uncaught conditions being catastrophic, at least this level of the programme. Uncaught exceptions are dispatched upward, potentially terminating the programme if not caught at any level.

This looks neat and tidy. It seems to provide a nice separation between the 3 classes of path. Unfortunately it’s not quite so simple. There are now different kinds of syntax for different kinds of programme flow, but the code “below” (returning values or raising exceptions) gets to choose which class will be used, and this will not always suit the code “above”, which must receive and dispatch the result. E.g. “File not found” may be exceptional in one context, but expected in another (that can differ according to programmer, or just the specifics of different cases of code or even different runtime conditions). The code below decides that classification, and it becomes binding on the code above. The code above may have to implement error-handling code in its primary path, or wrap retry code around something the below code considers exceptional.

A further flaw is that there can be a large variety of exceptions. Indeed, in runtime with dynamic code loading/binding, the code below may raise exceptions that never existed when the code above was written. Meaning it can be difficult for the above code to anticipate (particularly in the “checkable by tools” sense) in advance what those exceptions might be. Partly due to this, some languages have added checked exceptions, which allow below code to explicitly declare some of the exceptions they can throw. However, this still does not prevent arbitrary unchecked exceptions being thrown, while complicating the interfaces. With checked vs unchecked exceptions there is the same problem that the below code gets to choose classifications that greatly impact the above code – about whether exceptions are checked or not – even though the above code might have different needs. This leads to things like code being bloated  with catches for checked exceptions that do not have any (useful) code attached.

Another pitfall is that exceptions can occur at any point in the primary path code. This can make it hard to understand the implications of the code in how its control can flow from the primary code block to the exception blocks, particularly with regard to resources allocated and whether or not they are safely deallocated. This can lead to resources being “leaked” unwittingly, e.g. when exceptions occur during construction of composite objects (in many languages this means the destructor is not called, e.g. C++, Java, C#).

To recap, exceptions leads to problems that there are now potentially 2 separate mechanisms affecting control flow. The calling and called code may disagree on which control-flow conditions are exceptional and which normal or easily recoverable. The interactions of the 2 mechanisms can make it hard to grasp all the possible flows. This can lead to quite cluttered code, as in this (very contrived 😉 ) example:

bool retry;

do {
  retry = false;
  try {
      blah ();
     // below code thinks this is normal, we think its bad
     if (!something ())
        throw exception ();
     if (something_else ()) {
        // not what we wanted, need to do some complicated
        // recovery
     }
     w = whatever ();
     // etc..
  } catch (ex1) {
    // actually, we can recover from this, but need to try again
    ...
    retry = true;
  } catch (ex2) {
    if (foo ()) {
      // possibly recoverable, do lots of complicated stuff
    }
    if (bar ())
      retry = true;
  } finally {
    // cleanup
  }
} while (retry);

Though the primary path may be simpler than otherwise, there are now multiple, interacting code-flow dispatching mechanisms. As a result of which code using exceptions sometimes may end up being cumbersome to follow, if not error-prone.

Finite State Machine Control-Flow Branch Reduction

Reducing the number of branches that control flow can follow is useful. Exceptions achieve that to an extent – the primary path code certainly can look like it has fewer branches in it. The problem with exceptions though is this just a superficial reduction. The branches are still there, under the hood – an exception could be thrown at any point! Once you have to reason about those underlying branches (e.g. to ensure resources aren’t leaking), you may end up having to consider more control-flow branches than you would without exceptions.

An alternative strategy, applicable generally, is to construct your software so that error related control flow branches can be minimised. Imagine a “foo” object, that has a number of operations, any of which could fail, and where some operations are only valid depending on the context, (e.g. what other operations have been carried out on the object).

With standard, return-argument-checking error-handling the control-flow for a series of operations might look like this:

if (!foo.twiddle ()) {
  // twiddle failure/error handling
}
if (foo.swizzle ()) {
  // swizzle error handling
}

if (foo.x (arg)) {
  // x setting error handling
}

if (foo.frobnicate ()) {
  // frobnicate error handling
}
// etc.

Obviously, this could get tedious and error-prone (not a good property for error-handling code 😉 ). It’d be nice to be able to declutter the calling code, but without introducing hidden control flow hazards. We can do this, if we make the below/called code wrap its behaviour in a state machine, with an idempotent, non-functional error state:

foo {
   enum states { A, B, ERROR, } state = A;
   enum states prev_state = A;
   enum events { TWIDDLED, SWIZZLED, FROBNICATED, NEWX, INVALID, };
   fsm [states][events] = {
      [A][TWIDDLED] = A,
      [A][SWIZZLED] = B,
      [A][FROBNICATED] = B,
      [A][NEWX] = ERROR,
      [A][INVALID] = ERROR,

      [B][TWIDDLED] = ERROR,
      [B][SWIZZLED] = ERROR,
      [B][FROBNICATED] = B,
      [B][NEWX] = B,
      [B][INVALID] = ERROR,

      [ERROR][TWIDDLED] = ERROR,
      [ERROR][SWIZZLED] = ERROR,
      [ERROR][FROBNICATED] = ERROR,
      [ERROR][NEWX] = ERROR,
      [ERROR][INVALID] = ERROR,
   };

   fsm_event (events event) {
     if (state != ERROR)
       prev_state = state;
     return (state = fsm[state][event]);
   }

   twiddle () {
     if (fsm_event (TWIDDLE)) == ERROR) return;
     //twiddle the foo
     if (//some error condition) {
        fsm_event (INVALID);
     }
   }
   frobnicate () {
     if (fsm_event (FROBNICATE)) == ERROR) return;
     // frobnicate the foo
   }

   // etc..
}

Then, if desired, in the caller errors need not be handled immediately. This allows the caller to unconditionally bunch a number of operations on foo together. After they complete, the foo object’s state can be checked to see if an error occurred and:

foo.twiddle ();
foo.swizzle ();
foo.x (arg);
foo.frobnicate ();

if (foo.state () != foo.states.ERROR)
  return;

switch (foo.prev_state ()) {
  foo.states.TWIDDLED:    // twiddle failure/error handling
  foo.states.SWIZZLED:    //  etc..
  foo.states.FROBNICATED: // ..
  // etc..
}

On the face of it, this doesn’t look too dissimilar from how exceptions might have helped re-structure the code. However, unlike exceptions, this approach does not introduce any additional, hidden control-flow branches in the above/calling code. The control-flow in the calling/above code is now very clear and simple.

The FSM tables within the called code looks a bit tedious to write. However, it could be made shorter if you can initialise the default transition to ERROR (e.g. in C-like languages by making ERROR be equal to 0 and using a fill function like memset, or by relying on default initialisation rules in the language for statically allocated objects), as only the non-error cases would have to be filled in.

Ideally, a language would have concise, syntactical support for specifying the allowed state transitions. Also, FSMs can potentially be sanity-checked by tools or the compiler to ensure they have certain properties, if desired (e.g. that there is at least 1 path from a start state to the end states; that all states are reachable from the initial state; etc.). They can be made part of the/a type-system to facilitate this, which allows even more powerful static checking, catching invalid uses of an object – a recently graduated PhD at our school (Iain McGinniss) worked on this for Java.

Now, it could be argued that even with concise syntax for specifying the FSM, this could still lead to long, complex and/or otherwise hard to follow FSMs. This is certainly true. However, that complexity would be there regardless of whether the FSM is explicitly written down. Indeed, forcing oneself to explicitly write out the FSM for the object would mean such control-flow complexity would have no place to hide. The programmer would be confronted with that complexity sooner rather than later, and be able to address it earlier in the design/coding process than otherwise. The programmer could rethink their approach to try avoid the complexity. Having to think about all the possible states and events might also confront the programmer with combinations they would otherwise have not considered, allowing them to make the code more robust.

This is complexity which would be exposed to to any user of that code and have to be dealt with, regardless of whether the programmer codifies it explicitly in an FSM table or not. Though the FSM approach at least may make it easier for the caller to deal with that complexity than otherwise. Overall, confronting the programmer with the intrinsic complexity of their code may be as much a benefit as a hindrance!

This FSM approach is one I’ve seen only rarely. Even then, it’s usually been in code where the FSM tables are specified by some network protocol. Probably the programmer didn’t intentionally programme in this style, it just followed from the protocol specification they were implementing. However, it can be applied to any kind of interaction between different bits of code, and has some benefits that are worth considering.

What other nice methods are there for integrating error-handling into control-flow?

3 Comments »

  1. Informatimago said

    There’s a bug in the very contrived example: retry should be reset at the beginning of the loop, not outside of it.

    • Paul Jakma said

      Indeed. 🙂 Fixed. Thanks!

  2. Paul Jakma said

    Apparently Rust has some kind of interesting “conditions” and “traps”, to enhance error handling. Like a mix between exceptions (in so far as it searches upward for matching traps) and callbacks (the matching error handling code is called from the context of where the error happens).

    Pointed out by Colin Perkins on twitter: https://twitter.com/csperkins/status/408610124463632384

RSS feed for comments on this post · TrackBack URI

Leave a comment