Building Resilient Systems: A Practical Guide to Designing APIs, Microservices, and Infrastructure for High Availability

In today's always-on digital landscape, system failures are inevitable. What separates good systems from great ones is resilience—the ability to recover gracefully from failures and continue operating. Whether you're dealing with an unresponsive external API, a message queue backlog, or a full-blown regional outage, your system should be designed to bounce back.
This article delves into building resilient systems, focusing on API Gateways, microservices, messaging systems, databases, and external services. We'll also explore how to architect for multi-AZ and multi-region failures, and how tools like HPA, VPA, SLAs, and SLOs play a crucial role.
The Core Components of a Resilient System
To design a resilient distributed system, focus on the following core components:
API Gateway
Entry point to your services. Should handle rate limiting, caching, authentication, and graceful error handling.
Microservices
Independent, loosely coupled services that can fail without bringing down the whole system.
Messaging (Event Bus)
Asynchronous communication between services using queues or topics. Enables decoupling and backpressure handling.
Databases
Backbone for persistence. Must support HA, replication, and backup strategies.
External Services
Third-party APIs - unpredictable dependencies. Must plan for slowness, errors, or unavailability.
Resilient System Architecture
Multi-Layer Defense Strategy
Comprehensive resilience across all system layers with failover capabilities
Circuit breakers, health checks, and automatic failover mechanisms
Redundancy at every layer: Load balancers, API gateways, services, and data
Comprehensive monitoring, tracing, and alerting for proactive issue detection
Common Failure Scenarios & Detailed Solutions
External Service Is Slow or Unavailable
Problem: Third-party APIs may experience downtime, slow responses, or rate limiting.
Recommendations:
Circuit Breaker Pattern
Utilise libraries such as Netflix Hystrix or Resilience4j. Set failure thresholds and fallback behaviours. This prevents retry storms.
Implementation: Configure 50% failure rate over 20 requests to open circuit for 30 seconds
Timeouts
Always define timeouts (e.g., 2-3s max). Never rely on default indefinite waits.
Implementation: Set connection timeout: 1s, read timeout: 3s
Retries with Exponential Backoff
Retry 2-3 times with increasing delay (e.g., 100ms, 300ms, 900ms). Add jitter to avoid a thundering herd.
Implementation: Base delay: 100ms, multiplier: 3, max delay: 5s, jitter: ±20%
Fallbacks
Return cached or default data. For example, show 'temporarily unavailable' instead of blocking the UI.
Implementation: Cache last successful response for 5 minutes, serve stale data during outages
Caching
Cache infrequently changing data (e.g., currency exchange rates) using Redis or CDN.
Implementation: TTL: 1 hour for reference data, 5 minutes for dynamic data
Inter-service Communication Failure or Latency
Problem: REST or gRPC calls between services fail due to timeouts, overload, or downstream issues.
Recommendations:
Service Mesh (Istio, Linkerd)
Enforce retry policies, circuit breakers, mTLS, and observability uniformly across services.
Implementation: Configure Istio with 3 retries, 25ms base interval, circuit breaker at 5 consecutive failures
Bulkheads
Allocate resources (e.g., threads, memory) per service to prevent one from starving others.
Implementation: Separate thread pools: 50 threads for critical services, 20 for non-critical
Asynchronous Communication
Use Kafka or RabbitMQ for loosely coupled communication. Add retry logic at the consumer level.
Implementation: Kafka with 3 partitions, consumer groups with dead letter topic after 5 retries
Idempotent APIs
Design APIs to safely handle retries. Use request IDs to de-dupe operations.
Implementation: Include UUID in request header, store processed IDs in Redis for 24 hours
Messaging System Backlog or Downtime
Problem: Message queues get backlogged or fail due to slow consumers or broker issues.
Recommendations:
Consumer Auto-Scaling
Use HPA or KEDA to scale consumers based on queue length or lag.
Implementation: KEDA: Scale from 2 to 50 replicas when lag > 100 messages
Dead Letter Queue (DLQ)
Redirect unprocessable messages to a DLQ for later analysis.
Implementation: Move to DLQ after 3 processing attempts, retain for 7 days
Batching & Prefetching
Process messages in batches to increase throughput.
Implementation: Batch size: 100 messages, prefetch: 200, commit every 5 seconds
Monitoring
Track message lag, consumer throughput, and DLQ volume.
Implementation: Alert when lag > 1000 messages or DLQ growth > 10 messages/minute
Database Outage or Performance Degradation
Problem: Databases are single points of failure and performance bottlenecks.
Recommendations:
Read Replicas
Offload reads from the primary. Use read-write splitting via proxies like ProxySQL.
Implementation: 3 read replicas across AZs, route 80% reads to replicas
Failover Mechanisms
Use managed services with auto-failover (e.g., AWS RDS Multi-AZ).
Implementation: RDS Multi-AZ with 1-2 minute automatic failover
Connection Pooling
Avoid connection exhaustion. Tune HikariCP or pgBouncer.
Implementation: Pool size: 20-30 connections per service instance
Data Partitioning
Use sharding for horizontal scaling. Maintain idempotent writes.
Implementation: Shard by user_id hash, 4 shards initially, plan for 16 shards
API Gateway Outage or Bottleneck
Problem: If the API Gateway fails, all downstream services become inaccessible.
Recommendations:
HA Deployment
Deploy multiple replicas across AZs.
Implementation: 3 replicas minimum, spread across 3 AZs
Global Load Balancer
Use Route 53 + CloudFront for geo-aware routing and failover.
Implementation: Health checks every 30 seconds, failover within 60 seconds
Rate Limiting & Quotas
Prevent abuse and resource starvation.
Implementation: 1000 requests/minute per API key, burst limit: 100 requests/second
Request Tracing
Correlate requests across services using x-request-id.
Implementation: Generate UUID at gateway, propagate through all service calls
Resilience Across Availability Zones and Regions
Multi-AZ Strategy
Distribute pods and nodes across AZs
Ensure workloads span multiple availability zones
Stateful workloads support AZ-aware replication
Databases and queues replicated across zones
Load Balancers route across AZs
Use readiness probes to remove unhealthy pods
Multi-Region Strategy
Active-Passive Setup
Replicate data. Use DNS failover with health checks
Active-Active Setup
Use globally distributed services (e.g., DynamoDB Global Tables)
Data Replication
Async replication for performance, consider consistency trade-offs
CI/CD
Deploy to both regions using GitOps and promote when healthy
Operational Resilience: SLAs, SLOs, SLIs
Service Level Agreement
What you promise
99.9% uptime = < 43.8 mins downtime/month
Service Level Objective
What you target
99.95% = 21.6 mins/month
Service Level Indicator
Metrics you measure
error rate, availability, latency
Recommendations:
Use tools like SLO Tracker, Nobl9, or custom Prometheus queries
Tie alerts to SLOs, not system metrics (avoid alert fatigue)
Kubernetes Native Tools for Resilience
Horizontal Pod Autoscaler (HPA)
Vertical Pod Autoscaler (VPA)
Pod Disruption Budgets (PDB)
Liveness & Readiness Probes
Observability & Troubleshooting
What to Monitor
API Gateway
Latency, 5xx rates
Microservices
Error rates, CPU/memory
Queues
Lag, DLQ volumes
Database
Slow queries, connection counts
Recommended Tools
Metrics
Prometheus + Grafana
Tracing
OpenTelemetry + Jaeger or Tempo
Logs
Fluent Bit, EFK, or Loki
Alerting
Alertmanager, PagerDuty, OpsGenie
Observability Best Practices
Final Thoughts
Resilience isn't a one-time setup; it's a continuous process and culture. By designing for failure, implementing layered defence strategies, and enforcing observability, you ensure your platform can thrive under stress.
Key Principles
- • Automate recovery
- • Detect failures early
- • Scale predictively
- • Document runbooks
Remember
Make resilience a first-class citizen in your architecture.
"Hope for the best, but architect for the worst."