Layers of Failure, With a Side of Bacon

By, Julie Pitt

Remember when we talked about action codes as a means for unambiguous, actionable protocol errors? You may have wondered why you need all this fancy stuff like action codes and error metadata, when there’s already a well-defined standard in HTTP status codes. It’s working for you, so why mess with a good thing?

Today we will dive into an example REST over HTTP application. I’ll show you why it’s important to separate failures in the transport layer from those in the application layer. Then I’ll give an example of how to do it.

Bacon is the New Coffee

I briefly considered making an example API that is all about coffee when I realized that there are not nearly enough bacon applications out there. Let’s start with a little story.

Greybird Labs is a fast-growing company with headquarters in a 5-story building. Many Greybird employees are bacon fanatics, so naturally, the company recently installed a bacon cooker on each floor. Since a bacon cooker is a thing I just made up, I can also say that it has a convenient REST API, which employees can call from their desks to start cooking bacon.

Bringing Home the Bacon

As a recent hire at Greybird, you’re in the midst of a massive refactor. All of a sudden, you can’t decide whether to use recursion or a for loop. “Aha!”, you exclaim (as I often do), “I need bacon!” Conveniently, you have created an alias called ‘bacon’ for the following command:

$ wget --post-file "order-bacon.json" \
> --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" `# authentication token` \
> --header "Content-Type: application/json" \
> http://bacon.greybirdlabs.com/v1/0/placeOrder `# not a real domain, so don’t try it`

The order-bacon.json file (i.e., the request body) contains:

{
 “baconType”: “BlackForest”,
 “numberOfPieces”: 3
}

The response indicates that all is well:

{
 “machineId”: 0
 “jobId”: 243,
 “baconType”: “BlackForest”,
“numberOfPieces”: 3
}

You’ve submitted your bacon job. Is it done yet? To find out, you type:

$ wget --header "X-Greybird-Auth: SSBMb3ZlIEJhY29u" \
> http://bacon.greybirdlabs.com/v1/0/status/243

In return, you see:

{
  “machineId”: 0
  “jobId”: 243,
  “status”: “cooking”
  “baconType”: “BlackForest”,
  “numberOfPieces”: 3
}

Your bacon order is still cooking. While you wait, read on.

Bacon as a Service (BaaS)

By now you’ve figured out that the bacon cooker API is pretty simple. It has two resources:

POST /v1/[machineId]/orderBacon
GET /v1/[machineId]/status/[jobId]

Each bacon cooker is assigned a machineId, which is the floor number the machine is on. Greybird’s engineering team thought it would be funny to use a 0-based index for floors, so machineId 0 is actually on the first floor. Each active order is assigned a jobId, which can be used to track the status of your order.

Bacon Foul

Now let’s think about what can go wrong when it comes to ordering bacon.

  • Credentials (i.e., X-Greybird-Auth header) missing or not authentic
  • Invalid URL path (i.e., no matching handler could be found)
  • Unanticipated error in the endpoint (probably a bug)
  • Server is busy and can’t take more requests
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data (numberOfPieces, baconType)
  • Not authorized (e.g., employee doesn’t have permissions to access a job)

A common model for failures in an API like this is to map each one onto an HTTP status code. You may have noticed that some of these failures are quite specific to ordering bacon, but others are generic enough that they would apply to other applications. For example, if Greybird wants to start offering eggs, they’d like to leverage much of the protocol and service stack already developed for bacon.

Bacon, Eggs and Reuse

A major drawback of using HTTP status codes for all failures is that a single piece of code needs to understand failures in both message transport and the application. This makes it nearly impossible to reuse error handling code for eggs. Testability suffers as well since testing the application now has a dependency on HTTP. Not to mention the ambiguity and brittleness that may be caused by overloading status codes.

A better model is to separate the protocol and corresponding failures into at least two layers: transport layer and application layer. The transport layer is responsible for sending and receiving messages via HTTP, but doesn’t care what the application does. The application layer knows the semantics of bacon ordering but doesn’t care how the orders came in.

Here’s how we might categorize the failures into layers.

Transport layer

  • Credentials missing or not authentic
  • Unanticipated error in the endpoint
  • Server is busy and can’t take more requests

Application layer

  • Invalid URL path
  • Order queue is full
  • No such machine ID
  • No such job ID
  • Invalid input data
  • Not authorized

Now that things are separated, we can map HTTP status codes onto transport layer failures.

401 -> Credentials missing or not authentic
500 -> Unanticipated error in the endpoint
503 -> Server is busy and can’t take more requests

Better yet, as we learned in my last post, let’s not define each and every possible failure in our protocol spec. Instead, enumerate the subset of HTTP status codes returned by the API.

400 Bad Request
401 Unauthorized
500 Internal Server Error
503 Service Unavailable
...etc...

Then, we enumerate the actions the client can take as a result of a failure.

enum ActionCode {
  Retry 
  DoNothing
  ObtainCredentials
}

In the protocol spec, we define which action code is implied by each HTTP status code. The client then acts according to the implied action code.

400 -> DoNothing
401 -> ObtainCredentials
500 -> DoNothing
503 -> Retry
...etc...

Bacon != Eggs

What about application layer failures? A convenient delivery mechanism for application layer failures is the body of a 200 status code response. An alternative model is to reserve a specific status code (other than 200) to indicate an application-level failure, and place failure details in the body. I will describe the former method.

In the case of using a 200 for application failure, the body is now a wrapper that tells whether the request was successful. If successful, the response data is found inside the wrapper.

{
  “status”: “Success”,
  “responseData”:
    {
      “machineId”: 0
      “jobId”: 243,
      “baconType”: “BlackForest”,
      “numberOfPieces”: 3
    }
}

Otherwise, the wrapper contains the failure.

{
  “status”: “Failure”,
  “error”: {
      “actionCode”: “Retry”,
      “details”: {
        “name”: “OrderQueueFull”,
        “errorCode”: “1234”,
        “description”: “Order queue is full. Wait and retry.”
      }
    }
}

The client only needs to check the status field. If the status is Failure, the client can then unwrap the error field and act according to the actionCode. Conversely if the status is Success, the client can then forward the response to the appropriate handler.

Summary

With this design, it is possible to keep application logic completely separate from the business of transporting messages. The benefits of such a design include independent reuse of transport and application logic, testability of both client and server applications and resilience against failure scenarios. It has worked well at several companies I’ve worked for, both in initial design phases and at scale.

Meanwhile, the employees of Greybird Labs are clinking their bacon strips as they toast to one more successful refactor. By the way, your bacon is done.

5 thoughts on “Layers of Failure, With a Side of Bacon

  1. topher1120

    There’s a concern with using a 200 with application errors. My team works with an API that works this way. The problem is that memory muscle of most developers and some http clients is to treat the 200 regardless of checking the body.

    Reply
  2. yakticus Post author

    The model of wrapping both success and failure cases inside of a 200 response has worked well in my experience, provided that there is a base library shared by a suite of applications that use this protocol layering technique. That way, most developers don’t even need to think about it. An alternative and equally viable approach is to overload a specific response code for errors.

    Reply
    1. topher1120

      I agree. If the service provider owns the client library, this works well. If not, the assumption that a 200 can have errors needs to be clearly conveyed in the web service documentation that (hopefully) the client developer will read.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s