Previously I wrote about propagating deadlines across services to establish an upper bound for timeouts. We avoid doing work for callers that have stopped listening. For calls that fail quickly, we can easily retry if time permits. However, for unresponsive services we must either timeout before the deadline or find an alternative retry strategy.
You could measure a target service's latency and set timeouts for the tail of its distribution. But there's a problem: latency distributions can shift over time. Increased load or system faults can make a service slower. Hard-coded timeouts dailed-in for a happier situation may be too small for the changes, leading to spurious retries and adding more load to an already stressed system. This can lead to a feedback loop which knocks over the service entirely.
You could continually measure a target service's latency and set timeouts near the latency distribution tail. TCP does something similar with setting retransmission timeouts. But what if response times depend on the contents of the request?
Maybe your service generates outputs for a typical customer in 100 msec, but your largest and most important customers have much more data and are an order of magnitude slower to process. Those big customers are all in the tail of your latency distribution, and you're now constantly timing out and failing their requests!
A service can be characterized by whether response times are independent of request content. Key-value lookups against Memcache or DynamoDB may meet this criteria. For these services, using a measured latency distribution for setting timeouts makes sense. We don't need to worry about consistently failing requests for specific large customers.
Another approach is simply being super conservative with timeouts. Pick a threshold that drastically exceeds even the worst case. For example, if invoice generation ranges from 100 to 1000 msec, a timeout of 5000 msec might be pretty safe. Fast services make it easier to have conservative timeouts while still having time to retry within an overall deadline.
Predictable response times help. As mentioned before, this may not be achievable if the amount of work varies by request. For other cases, techniques like load shedding can keep queues and latency from spiralling out of control.
Finally, I want to mention the Hedged Requests idea from Jeff Dean's Tail at Scale paper. Instead of timing out, you make a second request against a different service instance and wait for whichever response comes first. This technique could even be applied in some cases where response time varies by request content.