Building Resilient Systems: A Practical Guide to Designing APIs, Microservices, and Infrastructure for High Availability

16 min read
SyntheBrain Team
Resilient SystemsHigh AvailabilityCircuit BreakerKubernetesObservability
Resilient System Architecture showing API Gateway, Microservices, Multi-AZ deployment

In today's always-on digital landscape, system failures are inevitable. What separates good systems from great ones is resilience—the ability to recover gracefully from failures and continue operating. Whether you're dealing with an unresponsive external API, a message queue backlog, or a full-blown regional outage, your system should be designed to bounce back.

This article delves into building resilient systems, focusing on API Gateways, microservices, messaging systems, databases, and external services. We'll also explore how to architect for multi-AZ and multi-region failures, and how tools like HPA, VPA, SLAs, and SLOs play a crucial role.

The Core Components of a Resilient System

To design a resilient distributed system, focus on the following core components:

API Gateway

Entry point to your services. Should handle rate limiting, caching, authentication, and graceful error handling.

Microservices

Independent, loosely coupled services that can fail without bringing down the whole system.

Messaging (Event Bus)

Asynchronous communication between services using queues or topics. Enables decoupling and backpressure handling.

Databases

Backbone for persistence. Must support HA, replication, and backup strategies.

External Services

Third-party APIs - unpredictable dependencies. Must plan for slowness, errors, or unavailability.

Resilient System Architecture

Multi-Layer Defense Strategy

Comprehensive resilience across all system layers with failover capabilities

Global Load Balancer
Region A
Region B
API Gateway (Multi-AZ)
AZ-1
AZ-2
AZ-3
Microservices (Circuit Breakers)
User Service
Order Service
Payment Service
Notification Service
Data Layer (HA + Backups)
Primary DB
Replica DB
Message Queue
Auto-Recovery

Circuit breakers, health checks, and automatic failover mechanisms

Multi-Layer Defense

Redundancy at every layer: Load balancers, API gateways, services, and data

Observability

Comprehensive monitoring, tracing, and alerting for proactive issue detection

Common Failure Scenarios & Detailed Solutions

External Service Is Slow or Unavailable

Problem: Third-party APIs may experience downtime, slow responses, or rate limiting.

Recommendations:

Circuit Breaker Pattern

Utilise libraries such as Netflix Hystrix or Resilience4j. Set failure thresholds and fallback behaviours. This prevents retry storms.

Implementation: Configure 50% failure rate over 20 requests to open circuit for 30 seconds
Timeouts

Always define timeouts (e.g., 2-3s max). Never rely on default indefinite waits.

Implementation: Set connection timeout: 1s, read timeout: 3s
Retries with Exponential Backoff

Retry 2-3 times with increasing delay (e.g., 100ms, 300ms, 900ms). Add jitter to avoid a thundering herd.

Implementation: Base delay: 100ms, multiplier: 3, max delay: 5s, jitter: ±20%
Fallbacks

Return cached or default data. For example, show 'temporarily unavailable' instead of blocking the UI.

Implementation: Cache last successful response for 5 minutes, serve stale data during outages
Caching

Cache infrequently changing data (e.g., currency exchange rates) using Redis or CDN.

Implementation: TTL: 1 hour for reference data, 5 minutes for dynamic data

Inter-service Communication Failure or Latency

Problem: REST or gRPC calls between services fail due to timeouts, overload, or downstream issues.

Recommendations:

Service Mesh (Istio, Linkerd)

Enforce retry policies, circuit breakers, mTLS, and observability uniformly across services.

Implementation: Configure Istio with 3 retries, 25ms base interval, circuit breaker at 5 consecutive failures
Bulkheads

Allocate resources (e.g., threads, memory) per service to prevent one from starving others.

Implementation: Separate thread pools: 50 threads for critical services, 20 for non-critical
Asynchronous Communication

Use Kafka or RabbitMQ for loosely coupled communication. Add retry logic at the consumer level.

Implementation: Kafka with 3 partitions, consumer groups with dead letter topic after 5 retries
Idempotent APIs

Design APIs to safely handle retries. Use request IDs to de-dupe operations.

Implementation: Include UUID in request header, store processed IDs in Redis for 24 hours

Messaging System Backlog or Downtime

Problem: Message queues get backlogged or fail due to slow consumers or broker issues.

Recommendations:

Consumer Auto-Scaling

Use HPA or KEDA to scale consumers based on queue length or lag.

Implementation: KEDA: Scale from 2 to 50 replicas when lag > 100 messages
Dead Letter Queue (DLQ)

Redirect unprocessable messages to a DLQ for later analysis.

Implementation: Move to DLQ after 3 processing attempts, retain for 7 days
Batching & Prefetching

Process messages in batches to increase throughput.

Implementation: Batch size: 100 messages, prefetch: 200, commit every 5 seconds
Monitoring

Track message lag, consumer throughput, and DLQ volume.

Implementation: Alert when lag > 1000 messages or DLQ growth > 10 messages/minute

Database Outage or Performance Degradation

Problem: Databases are single points of failure and performance bottlenecks.

Recommendations:

Read Replicas

Offload reads from the primary. Use read-write splitting via proxies like ProxySQL.

Implementation: 3 read replicas across AZs, route 80% reads to replicas
Failover Mechanisms

Use managed services with auto-failover (e.g., AWS RDS Multi-AZ).

Implementation: RDS Multi-AZ with 1-2 minute automatic failover
Connection Pooling

Avoid connection exhaustion. Tune HikariCP or pgBouncer.

Implementation: Pool size: 20-30 connections per service instance
Data Partitioning

Use sharding for horizontal scaling. Maintain idempotent writes.

Implementation: Shard by user_id hash, 4 shards initially, plan for 16 shards

API Gateway Outage or Bottleneck

Problem: If the API Gateway fails, all downstream services become inaccessible.

Recommendations:

HA Deployment

Deploy multiple replicas across AZs.

Implementation: 3 replicas minimum, spread across 3 AZs
Global Load Balancer

Use Route 53 + CloudFront for geo-aware routing and failover.

Implementation: Health checks every 30 seconds, failover within 60 seconds
Rate Limiting & Quotas

Prevent abuse and resource starvation.

Implementation: 1000 requests/minute per API key, burst limit: 100 requests/second
Request Tracing

Correlate requests across services using x-request-id.

Implementation: Generate UUID at gateway, propagate through all service calls

Resilience Across Availability Zones and Regions

Multi-AZ Strategy

Distribute pods and nodes across AZs

Ensure workloads span multiple availability zones

Stateful workloads support AZ-aware replication

Databases and queues replicated across zones

Load Balancers route across AZs

Use readiness probes to remove unhealthy pods

Multi-Region Strategy

Active-Passive Setup

Replicate data. Use DNS failover with health checks

Active-Active Setup

Use globally distributed services (e.g., DynamoDB Global Tables)

Data Replication

Async replication for performance, consider consistency trade-offs

CI/CD

Deploy to both regions using GitOps and promote when healthy

Operational Resilience: SLAs, SLOs, SLIs

SLA

Service Level Agreement

What you promise

99.9% uptime = < 43.8 mins downtime/month

SLO

Service Level Objective

What you target

99.95% = 21.6 mins/month

SLI

Service Level Indicator

Metrics you measure

error rate, availability, latency

Recommendations:

Use tools like SLO Tracker, Nobl9, or custom Prometheus queries

Tie alerts to SLOs, not system metrics (avoid alert fatigue)

Kubernetes Native Tools for Resilience

Horizontal Pod Autoscaler (HPA)

What: Auto-scales based on CPU, memory, or custom metrics.
How: Define threshold (e.g., CPU > 70%), min/max pods.
Tooling: Use KEDA for queue-based scaling.

Vertical Pod Autoscaler (VPA)

What: Adjusts CPU/memory requests based on real usage.
How: Use in 'recommendation' mode first, then move to 'auto' mode with safeguards.
Tooling: Monitor resource utilization patterns over time.

Pod Disruption Budgets (PDB)

What: Prevents voluntary disruptions from evicting too many pods.
How: Configure minAvailable or maxUnavailable.
Tooling: Essential for rolling updates and cluster maintenance.

Liveness & Readiness Probes

What: Health checks to ensure pod reliability.
How: Liveness: Restart if unhealthy. Readiness: Hold traffic until ready.
Tooling: Configure appropriate timeouts and retry intervals.

Observability & Troubleshooting

What to Monitor

API Gateway

Latency, 5xx rates

Microservices

Error rates, CPU/memory

Queues

Lag, DLQ volumes

Database

Slow queries, connection counts

Recommended Tools

Metrics

Prometheus + Grafana

Tracing

OpenTelemetry + Jaeger or Tempo

Logs

Fluent Bit, EFK, or Loki

Alerting

Alertmanager, PagerDuty, OpsGenie

Observability Best Practices

Standardise telemetry formats
Correlate metrics, logs, and traces by request ID
Create golden dashboards per service
Implement distributed tracing for complex flows

Final Thoughts

Resilience isn't a one-time setup; it's a continuous process and culture. By designing for failure, implementing layered defence strategies, and enforcing observability, you ensure your platform can thrive under stress.

Key Principles

  • • Automate recovery
  • • Detect failures early
  • • Scale predictively
  • • Document runbooks

Remember

Make resilience a first-class citizen in your architecture.

"Hope for the best, but architect for the worst."

Back to Blog