RepRep - Timeouts and Deadline Propagation

When redesigning a computational backend a few years ago, I had to determine timeouts for calling external dependencies. This is clearly critical for any service-oriented architecture, yet at the time I found only underwhelming articles and research online.

Some things are obvious. Having timeouts too small means spurious failures. Having them too big means lost opportunities to hide faults.

Less obvious is that when servicing interactive requests, there is an implicit deadline set by each caller's timeouts. There isn't any point assembling a response if the caller isn't around to receive it:

External user making interactive request

How long a caller is willing to wait depends on their circumstances. A caller populating an interactive UI might only wait a few seconds. A caller firing off bulk requests for an asynchronous workflow might be fine waiting minutes.

This leads to a key insight: each requestor should make this deadline explicit as they perform calls. Only if a service doesn't receive a explicit deadline should it fallback to some default value.

Each service marks the starting (monotonic) time when is begins processing a request. Before firing off calls to another service, it computes how much time remains under the original deadline. This diminished deadline is what we propagate to the service that we, in turn, are calling:

Here, Service A consumes x seconds itself before sending a request to Service B. We tell B that it has 5-x seconds to get its job done. Service B sucks up y seconds of our time budget. We then invoke Service C indicating it has 5-x-y seconds to work with. If the time budget runs out, we can abort further processing.

Our deadline value is an estimate. We can't characterize network transit times or service queueing times. In the worst case, we'll do some work even after the client has gone away. However, when dealing with high speed networks and services that load shed, it works out pretty well.

This technique only provides an upper bound for timeouts and retry loops. Without further information, it offers a reasonable default to use, though it leaves us vulnerable to unresponsive or abnormally slow service instances. Ideally, we would characterize target service latencies, either offline or in real time, to lower timeouts and provide opportunities for retries within the deadline.

A downside of propagating deadlines is that information must go from each service entry point to the code points making outgoing requests. I use common service boilerplate code to extract the parameters from an incoming request into an object. This object gets passed through the service routines. Finally, helper routines use this object to embed appropriate deadlines into outgoing requests.

Propagating deadline and request id from entry to exit

The same object that propagates deadlines through a service can also be used to propagate a per-request UUID. This is then also included in outgoing requests. By emitting the UUID in application logs, distributed tracing and debugging becomes simpler.

I've been using this deadline propagation technique successfully in production for a few years now. It works well, and coupled with a latency estimator it provides a principled way for handling timeouts and deadlines.

Writings - Timeouts and Deadline Propagation