System Design Cheat Sheet

Your comprehensive guide to scalable system architecture

ðŸŽŊ Non-Functional Requirements

Availability & Fault Handling

  • Availability: % uptime. Use load balancer, replication
  • Fault Tolerance: System keeps working despite failures
  • Resiliency: Recovery from failure (retries, circuit breaker)
  • Durability: Data persists post-crash (S3, WAL)
  • Disaster Recovery: Backups, multi-region failover

Performance & Scale

  • Scalability: Can grow (horizontal, vertical, sharding)
  • Throughput: Requests/sec or data volume/sec
  • Latency: Response time (P99 focus)
  • Efficiency: Optimal resource usage
  • Realtime: Sub-second processing

Data & Security

  • Consistency: Data sync across nodes (strong/eventual)
  • Auditability: Trace actions (logs)
  • Security: TLS, AuthZ/AuthN, WAF
  • Privacy/Compliance: GDPR, HIPAA

ðŸ§Đ Key Components & Building Blocks

Entry Points

  • DNS
  • CDN
  • Load Balancer
  • API Gateway

Processing

  • App Servers
  • Queues
  • Rate Limiters
  • Circuit Breakers

Data Layer

  • Databases (SQL/NoSQL)
  • Cache
  • Object Storage
  • Search Index

Techniques

Sharding Replication Consistent Hashing

🛠ïļ NFR to Tools Mapping

NFR Tools/Patterns
AvailabilityLoad balancer, replicas
Fault ToleranceRetry, circuit breaker
ScalabilitySharding, queues, stateless
LatencyCDN, Redis
ThroughputBatch, async queue
RealtimeWebSocket, SSE
DurabilityS3, WAL, replicated DB
ConsistencyLeader-election, quorum
AuditabilityLogs, append-only stores
SecurityAuthZ/AuthN, TLS, WAF
EfficiencyCaching, deduplication

📋 GANE-F Design Framework

G Goal: What is the problem/use case?
A Assumptions: RPS, scale, read vs write, PII?
N NFRs: List top 3-5 based on case
E Explain design: Use 4-6 core components
F Focus/Tradeoffs: Latency vs durability, etc.

❓ Clarifying Questions Framework

  • Who are the users and usage volume?
  • Real-time or async?
  • Data types and sizes?
  • Read-heavy or write-heavy?
  • Regional or global? Compliance?
  • Can it fail? What happens?

🔄 Common Request Flows

Standard Web Flow

Client → CDN → Load Balancer → API Gateway → App Server → Cache/DB

Async Processing Flow

Client → API → Queue → Worker → DB → Notification Service

Event-Driven/Microservices

Service A → Event Bus → Service B/C/D → Message Queue → Downstream

Data Pipeline Flow

Data Source → Stream (Kafka) → ETL → Data Lake → Analytics/ML

Mobile/IoT Flow

Device → Gateway → Queue → Stream Processor → Time Series DB

ðŸ’Ą Buzzwords & Smart Phrases

CAP Theorem Consistency vs Availability Stateless Idempotent Backpressure Fan-out/Fan-in CQRS Fail-fast Fail-safe Fail-open

Quick Decisions

  • Read-heavy → Cache (Redis)
  • Write-heavy → WAL + Strong DB
  • Blob Storage → S3 + CDN

ðŸŽĻ Common Reusable Patterns

Use Case Solution
Read-heavyRedis cache, replicas
Write-heavyQueues, WAL, partitions
Global usersCDN, geo-replication
Media filesS3 + CDN
PII/ComplianceEncryption, access logs
Async processingQueue + background workers
High availabilityMulti-AZ, health checks
Real-time updatesWebSocket, Server-Sent Events
Batch processingKafka + Spark/Flink
SearchElasticsearch, inverted index

🗄ïļ Database Patterns & Data Modeling

Database Selection

  • RDBMS: ACID, complex queries, transactions
  • NoSQL Document: MongoDB, flexible schema
  • Key-Value: Redis, DynamoDB, simple lookups
  • Column-Family: Cassandra, time-series data
  • Graph: Neo4j, relationships/networks
  • Time-Series: InfluxDB, metrics/monitoring

Scaling Patterns

  • Read Replicas: Scale reads, eventual consistency
  • Sharding: Horizontal partitioning by key
  • Federation: Split DBs by feature/service
  • Denormalization: Trade space for speed
  • CQRS: Separate read/write models

🌐 Distributed Systems Essentials

Core Concepts

  • CAP Theorem: Consistency, Availability, Partition tolerance
  • ACID vs BASE: Strong vs eventual consistency
  • Consensus: Raft, Paxos for leader election
  • Vector Clocks: Ordering events in distributed systems
  • Bloom Filters: Probabilistic data structures

Failure Patterns

  • Circuit Breaker: Prevent cascading failures
  • Bulkhead: Isolate critical resources
  • Timeout & Retry: Exponential backoff
  • Graceful Degradation: Reduce functionality
  • Health Checks: Liveness & readiness probes

📊 Performance & Monitoring

Metrics & SLOs

  • Golden Signals: Latency, Traffic, Errors, Saturation
  • SLI/SLO/SLA: Service Level Indicators/Objectives/Agreements
  • Error Budgets: Acceptable failure rate
  • Percentiles: P50, P95, P99 for latency

Optimization Techniques

  • Caching Layers: Browser → CDN → App → DB
  • Connection Pooling: Reuse DB connections
  • Batch Processing: Reduce network calls
  • Compression: Gzip, Brotli for data transfer
  • Indexing: B-tree, LSM-tree strategies

🔐 Security & Compliance Patterns

Authentication & Authorization

  • OAuth 2.0/OpenID: Delegated authorization
  • JWT: Stateless tokens
  • RBAC: Role-based access control
  • API Keys: Service-to-service auth
  • mTLS: Mutual TLS for microservices

Data Protection

  • Encryption: At rest (AES-256), in transit (TLS 1.3)
  • PII Handling: Tokenization, hashing, masking
  • Compliance: GDPR, HIPAA, SOC2
  • Rate Limiting: Prevent abuse, DDoS protection
  • WAF: Web Application Firewall

ðŸŽŊ Staff Engineer Rapid-Fire Answers

Scale Numbers (Memorize These)

  • L1 Cache: 1ns | RAM: 100ns | SSD: 10-100Ξs
  • Network: 1ms | Disk: 10ms | Internet: 100ms
  • Read 1MB: Memory 250Ξs, SSD 1ms, Disk 20ms
  • Typical Web: 1K-10K RPS | High Scale: 100K+ RPS

Quick Decision Tree

  • < 1GB data: Single DB | < 1TB: Sharding/Replicas
  • < 1K RPS: Monolith | < 10K: Microservices
  • Global: CDN + Multi-region | Real-time: WebSocket/SSE
  • Analytics: Data Lake | ML: Feature Store

Trade-off Questions

  • Consistency vs Availability: Strong consistency = lower availability
  • Latency vs Throughput: Caching improves latency, batching improves throughput
  • Space vs Time: Denormalization trades space for query speed
  • Reliability vs Cost: More redundancy = higher cost

📚 Quick Reference & Resources

Essential Resources

  • Books: "Designing Data-Intensive Applications" (Kleppmann)
  • Papers: MapReduce, Dynamo, Bigtable, Raft
  • Blogs: High Scalability, AWS Architecture Center
  • Tools: Draw.io for diagrams, Figma for UI flows

Company Tech Stacks (Examples)

  • Netflix: Microservices, Cassandra, Kafka, AWS
  • Uber: Go/Java, Kafka, Cassandra, Real-time ML
  • Airbnb: Rails, MySQL, Redis, Kafka, Kubernetes
  • Pinterest: Python, MySQL, HBase, Kafka, Kubernetes