Resilience

Index | Previous: Design Principles

APIs fail. Networks drop. Services crash. Resilience is about designing systems that handle failure gracefully rather than catastrophically.

Service Unavailable

Dynamic horizontal scaling services may experience brief unavailability as they come online. Both clients and servers need strategies to handle this gracefully.

Return 503 Service Unavailable (not 504, which indicates a gateway timeout) with a Retry-After header¹ when the service is temporarily unavailable.

Exponential Backoff

Clients should implement retry logic with exponential backoff and jitter²:

async function fetchWithRetry(url, options, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);
      if (response.status !== 503 && response.status !== 429) {
        return response;
      }

      const retryAfter = response.headers.get('Retry-After');
      if (retryAfter) {
        await sleep(parseInt(retryAfter) * 1000);
      } else {
        await sleepWithJitter(attempt);
      }
    } catch (err) {
      if (attempt === maxRetries - 1) throw err;
      await sleepWithJitter(attempt);
    }
  }
  throw new Error('Max retries exceeded');
}

function sleepWithJitter(attempt) {
  const baseDelay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
  const jitter = Math.random() * 1000;           // 0-1s random jitter
  return new Promise(r => setTimeout(r, baseDelay + jitter));
}

Why jitter matters: Without jitter, when a service recovers from an outage, all waiting clients retry simultaneously, causing a thundering herd³ that brings the service down again.

Circuit Breaker Pattern

The circuit breaker⁴ prevents cascading failures by failing fast when a downstream service is unhealthy.

States:

Closed - Normal operation. Requests pass through. Track failure rate.
Open - Service is down. Fail immediately without calling downstream. Return cached data or graceful degradation.
Half-Open - After a timeout, allow a few test requests through. If they succeed, close the circuit. If they fail, reopen.

         success
            │
    ┌───────┴───────┐
    │               │
    ▼               │
┌────────┐    ┌─────┴────┐    timeout    ┌───────────┐
│ Closed │───▶│   Open   │────────-─────▶│ Half-Open │
└────────┘    └──────────┘               └───────────┘
  failure        │                            │
  threshold      │         failure            │
  reached        ◀────────────────────────────┘
                              │
                              │ success
                              ▼
                         back to Closed

Configuration:

Failure threshold - How many failures before opening (e.g., 5 failures in 60 seconds)
Open duration - How long to stay open before trying half-open (e.g., 30 seconds)
Half-open limit - How many test requests to allow (e.g., 3 requests)

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.state = 'CLOSED';
    this.failures = 0;
    this.lastFailure = null;
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Libraries like opossum⁵ (Node.js), resilience4j⁶ (Java), or Polly⁷ (.NET) provide production-ready implementations.

Graceful Degradation

When circuits open, don't just fail. Degrade gracefully:

Return cached data (even if stale)
Return partial results
Use fallback services
Queue requests for later processing

The goal is to keep the user experience functional, even if imperfect. Netflix's approach to graceful degradation⁸ is worth studying.

Timeouts

Every external call needs a timeout. Without timeouts, a slow downstream service can exhaust your connection pool and bring down your entire system.

Timeout guidelines:

Connect timeout - How long to wait for a connection (1-5 seconds)
Read timeout - How long to wait for a response (varies by operation)
Total timeout - Maximum time for the entire operation including retries

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 5000);

try {
  const response = await fetch(url, { signal: controller.signal });
} finally {
  clearTimeout(timeout);
}

Set timeouts based on your SLAs. If you promise p99 latency of 200ms, your downstream timeouts need to be well under that.

Bulkheads

Bulkheads⁹ isolate failures to prevent them from spreading. Named after ship compartments that contain flooding.

Implementation approaches:

Thread pool isolation - Each dependency gets its own thread pool. If one fills up, others continue working.
Connection pool isolation - Separate connection pools per service.
Semaphore isolation - Limit concurrent requests to each dependency.

class Bulkhead {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.queue = [];
  }

  async execute(fn) {
    if (this.current >= this.maxConcurrent) {
      throw new Error('Bulkhead full');
    }

    this.current++;
    try {
      return await fn();
    } finally {
      this.current--;
    }
  }
}

If your payment service is slow, it shouldn't take down your product catalog.

Caching

The inevitability of caching. I have worked on systems with so much caching wired between all the layers that it became impossible to find who was holding on to what. Caches at the CDN, the gateway, the service, the ORM, the database. Each layer trying to be "helpful." The result is stale data, mysterious inconsistencies, and debugging nightmares.

The pragmatic approach: Provide sensible caching at service boundaries that can be controlled by standard HTTP headers¹⁰. This satisfies consumers who want performance while keeping the system of record authoritative.

Cache-Control Headers

Use standard HTTP caching headers rather than inventing your own:

Cache-Control: max-age=3600, must-revalidate
ETag: "abc123"
Last-Modified: Wed, 21 Oct 2024 07:28:00 GMT
Vary: Accept-Encoding, Authorization

Key directives:

Directive	Purpose
`max-age`	How long (seconds) the response can be cached
`no-cache`	Must revalidate with origin before using cached copy
`no-store`	Never cache this response (sensitive data)
`private`	Only browser can cache, not shared caches (CDNs)
`public`	Any cache can store this response
`must-revalidate`	Once stale, must check origin before using

What to Cache

Good candidates for caching:

Reference data that changes infrequently (country lists, currency codes)
Public content (product catalogs, published articles)
Computed results that are expensive to generate
Static assets (images, documents)

Never cache:

User-specific data without proper Vary headers
Sensitive information (use Cache-Control: no-store)
Data from the system of record that must be current (account balances, inventory counts)
Responses to POST, PUT, PATCH, DELETE (by HTTP specification)

Caching Layers

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Browser │───▶│   CDN   │───▶│ Gateway │───▶│ Service │───▶│   DB    │
└─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘
   Cache 1       Cache 2       Cache 3        Cache 4       The Truth

The rule: The closer to the user, the less authoritative. The closer to the database, the more current. Design your cache strategy with this in mind.

Cache Invalidation

"There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton¹¹

Strategies:

Time-based expiry (TTL) - Simple but may serve stale data
Event-driven invalidation - Publish events when data changes, caches subscribe
Conditional requests - Use ETags¹² to check if cached data is still valid

GET /products/123
If-None-Match: "abc123"

# If still valid:
HTTP/1.1 304 Not Modified

# If changed:
HTTP/1.1 200 OK
ETag: "def456"
{...new data...}

Avoid Cache Stampedes

When a popular cached item expires, many requests hit the origin simultaneously. Strategies to prevent this:

Stale-while-revalidate¹³ - Serve stale data while refreshing in background
Lock-based refresh - Only one request refreshes, others wait
Probabilistic early refresh - Randomly refresh before expiry

Cache-Control: max-age=3600, stale-while-revalidate=60

This allows the cache to serve stale data for up to 60 seconds while fetching a fresh copy.

Redis as a Cache

For service-level caching, Redis¹⁴ is the common choice:

async function getCachedProduct(id) {
  const cached = await redis.get(`product:${id}`);
  if (cached) {
    return JSON.parse(cached);
  }

  const product = await db.products.findById(id);
  await redis.setex(`product:${id}`, 3600, JSON.stringify(product));
  return product;
}

Guidelines:

Set TTLs on everything (don't let caches grow unbounded)
Use consistent key naming ({entity}:{id})
Consider using hash structures for related data
Monitor cache hit rates; low rates suggest wrong strategy

Debugging Caching Issues

When you can't find "who's holding onto what":

Add cache headers to responses - Include X-Cache: HIT or X-Cache: MISS
Log cache interactions - Know when data is served from cache vs origin
Use unique request IDs - Trace a request through all layers
Document your cache topology - Everyone should know where caches exist

The goal is transparency: anyone debugging the system should be able to determine where data came from and how fresh it is.

Observability and Metrics

Observability is the ability to understand what's happening inside a system by examining its outputs. Plan for this at design time — bolting it on after launch is expensive and incomplete.

The Three Pillars

Pillar	What It Tells You	API Design Implications
Metrics	How the system is performing (quantitative)	Define what to measure, expose `/metrics` endpoint
Logs	What happened (events, errors, decisions)	Structured logging format, correlation IDs
Traces	How a request flowed through the system	Distributed tracing headers, span context propagation

SLIs, SLOs, and Error Budgets

Define these at design time, not after launch.

SLI (Service Level Indicator) — a quantitative measure of service health. Examples: request latency (p99), error rate, availability, throughput.
SLO (Service Level Objective) — a target value for an SLI. Example: "p99 latency < 200ms" or "error rate < 0.1%".
SLA (Service Level Agreement) — a contractual promise to consumers based on SLOs. Breaking an SLA has business consequences.
Error Budget — the amount of unreliability you can tolerate (100% minus SLO). If your SLO is 99.9% availability, your error budget is 0.1% downtime (~8.7 hours per year). When the budget is spent, freeze deployments and focus on reliability.

Define SLOs for at minimum:

Availability — percentage of successful requests (non-5xx)
Latency — p50, p95, p99 response times
Error rate — percentage of requests returning errors
Throughput — requests per second

Align SLOs with downstream timeout budgets. If your SLO is p99 < 200ms but a downstream call has a 5-second timeout, something is misconfigured.

See the Google SRE Book¹⁵ for a thorough treatment of SLIs, SLOs, and error budgets.

RED Metrics

For every endpoint, plan to capture RED metrics¹⁶:

Rate — requests per second
Errors — error count/rate by status code
Duration — latency distribution (p50, p95, p99)

Also track saturation metrics:

Connection pool utilisation
Thread/goroutine count
Memory usage
Queue depth

A note on cardinality: Do not use high-cardinality values (user IDs, request IDs, full URL paths with resource IDs) as metric labels. High-cardinality labels explode storage costs and make queries unusably slow. Use traces for high-cardinality data, not metric labels.

Expose a /metrics endpoint in Prometheus exposition format¹⁷ or equivalent for your observability stack.

Structured Logging

{
  "timestamp": "2024-01-15T09:30:00.123Z",
  "level": "error",
  "service": "orders-api",
  "correlationId": "req-abc123",
  "requestId": "7f3d9c",
  "message": "Payment service timeout",
  "durationMs": 5001,
  "downstream": "payment-service"
}

Use log levels consistently:

ERROR — failures requiring action, service-affecting problems
WARN — degradation, unusual conditions, approaching limits
INFO — business events, state transitions, successful operations
DEBUG — development detail, removed or sampled out in production

Never log sensitive data: passwords, tokens, full credit card numbers, or PII. Log the request ID, not the request body. Use a centralised log aggregation system (ELK, Loki, CloudWatch, Datadog) — individual service logs are insufficient for debugging distributed systems or detecting coordinated attacks.

Distributed Tracing

Distributed tracing gives you a complete picture of how a request flows across services. Without it, debugging a latency issue in a system with 10 services is nearly impossible.

Use OpenTelemetry¹⁸ as the instrumentation standard. Propagate W3C Trace Context¹⁹ headers across all service boundaries:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor-specific-data

Create spans for key operations:

HTTP handler (top-level span)
Database queries
External API calls
Queue publish/consume operations
Cache lookups

Configure trace sampling: 100% tracing at production scale is expensive. Sample 1–10% in production, 100% for errors and slow requests.

Alerting

Alerts should fire on user-facing impact, not just infrastructure metrics.

Alert on SLO burn rate — how fast are you consuming the error budget, not just "is the error rate above threshold right now". A brief spike at 10× the normal error rate is more urgent than a sustained rate at 2× — even if the absolute numbers are similar.

Every alert needs:

A clear description of the user-facing impact
A runbook or at least a first diagnostic step
An escalation path (who gets paged, in what order)
A way to silence it while you're actively working the issue

Avoid alert fatigue. An on-call engineer who receives 50 alerts per night will start ignoring them. Tune alert thresholds so every page is actionable.

References

Written by Philip A Senger | LinkedIn | GitHub

This work is licensed under a Creative Commons Attribution 4.0 International License.

Previous: Design Principles | Next: Payloads and Errors

Nottingham, M. and Fielding, R. (2012). "Additional HTTP Status Codes." RFC 6585, IETF. https://datatracker.ietf.org/doc/html/rfc6585 ↩
AWS. "Exponential Backoff and Jitter." AWS Architecture Blog. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ ↩
Wikipedia. "Thundering herd problem." https://en.wikipedia.org/wiki/Thundering_herd_problem ↩
Fowler, Martin. (2014). "CircuitBreaker." https://martinfowler.com/bliki/CircuitBreaker.html ↩
Nodeshift. "Opossum - Circuit Breaker for Node.js." https://github.com/nodeshift/opossum ↩
Resilience4j. "Fault tolerance library for Java." https://github.com/resilience4j/resilience4j ↩
App-vNext. "Polly - .NET resilience and transient-fault-handling library." https://github.com/App-vNext/Polly ↩
Netflix. "Making the Netflix API More Resilient." Netflix Tech Blog. https://netflixtechblog.com/making-the-netflix-api-more-resilient-a8ec62f4b4f7 ↩
Microsoft. "Bulkhead pattern." Azure Architecture Patterns. https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead ↩
Fielding, R. et al. (2014). "Hypertext Transfer Protocol (HTTP/1.1): Caching." RFC 7234, IETF. https://datatracker.ietf.org/doc/html/rfc7234 ↩
Attributed to Phil Karlton. https://martinfowler.com/bliki/TwoHardThings.html ↩
Fielding, R. and Reschke, J. (2014). "Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests." RFC 7232, IETF. https://datatracker.ietf.org/doc/html/rfc7232 ↩
Nottingham, M. (2010). "HTTP Cache-Control Extensions for Stale Content." RFC 5861, IETF. https://datatracker.ietf.org/doc/html/rfc5861 ↩
Redis. "In-memory data structure store." https://redis.io/ ↩
Google. "Site Reliability Engineering — Service Level Objectives." https://sre.google/sre-book/service-level-objectives/ ↩
Wilkie, Tom. (2018). "The RED Method: how to instrument your services." Grafana Blog. https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/ ↩
Prometheus. "Exposition Formats." https://prometheus.io/docs/instrumenting/exposition_formats/ ↩
OpenTelemetry. "Vendor-agnostic observability framework." https://opentelemetry.io/ ↩
W3C. "Trace Context." https://www.w3.org/TR/trace-context/ ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resilience

Service Unavailable

Exponential Backoff

Circuit Breaker Pattern

Graceful Degradation

Timeouts

Bulkheads

Caching

Cache-Control Headers

What to Cache

Caching Layers

Cache Invalidation

Avoid Cache Stampedes

Redis as a Cache

Debugging Caching Issues

Observability and Metrics

The Three Pillars

SLIs, SLOs, and Error Budgets

RED Metrics

Structured Logging

Distributed Tracing

Alerting

References

FilesExpand file tree

05-resilience.md

Latest commit

History

05-resilience.md

File metadata and controls

Resilience

Service Unavailable

Exponential Backoff

Circuit Breaker Pattern

Graceful Degradation

Timeouts

Bulkheads

Caching

Cache-Control Headers

What to Cache

Caching Layers

Cache Invalidation

Avoid Cache Stampedes

Redis as a Cache

Debugging Caching Issues

Observability and Metrics

The Three Pillars

SLIs, SLOs, and Error Budgets

RED Metrics

Structured Logging

Distributed Tracing

Alerting

References

Footnotes