ðŊ Non-Functional Requirements
Availability & Fault Handling
- Availability: % uptime. Use load balancer, replication
- Fault Tolerance: System keeps working despite failures
- Resiliency: Recovery from failure (retries, circuit breaker)
- Durability: Data persists post-crash (S3, WAL)
- Disaster Recovery: Backups, multi-region failover
Performance & Scale
- Scalability: Can grow (horizontal, vertical, sharding)
- Throughput: Requests/sec or data volume/sec
- Latency: Response time (P99 focus)
- Efficiency: Optimal resource usage
- Realtime: Sub-second processing
Data & Security
- Consistency: Data sync across nodes (strong/eventual)
- Auditability: Trace actions (logs)
- Security: TLS, AuthZ/AuthN, WAF
- Privacy/Compliance: GDPR, HIPAA
ð§Đ Key Components & Building Blocks
Entry Points
- DNS
- CDN
- Load Balancer
- API Gateway
Processing
- App Servers
- Queues
- Rate Limiters
- Circuit Breakers
Data Layer
- Databases (SQL/NoSQL)
- Cache
- Object Storage
- Search Index
Techniques
Sharding
Replication
Consistent Hashing
ð ïļ NFR to Tools Mapping
| NFR | Tools/Patterns |
|---|---|
| Availability | Load balancer, replicas |
| Fault Tolerance | Retry, circuit breaker |
| Scalability | Sharding, queues, stateless |
| Latency | CDN, Redis |
| Throughput | Batch, async queue |
| Realtime | WebSocket, SSE |
| Durability | S3, WAL, replicated DB |
| Consistency | Leader-election, quorum |
| Auditability | Logs, append-only stores |
| Security | AuthZ/AuthN, TLS, WAF |
| Efficiency | Caching, deduplication |
ð GANE-F Design Framework
G
Goal: What is the problem/use case?
A
Assumptions: RPS, scale, read vs write, PII?
N
NFRs: List top 3-5 based on case
E
Explain design: Use 4-6 core components
F
Focus/Tradeoffs: Latency vs durability, etc.
â Clarifying Questions Framework
- Who are the users and usage volume?
- Real-time or async?
- Data types and sizes?
- Read-heavy or write-heavy?
- Regional or global? Compliance?
- Can it fail? What happens?
ð Common Request Flows
Standard Web Flow
Client â CDN â Load Balancer â API Gateway â App Server â Cache/DB
Async Processing Flow
Client â API â Queue â Worker â DB â Notification Service
Event-Driven/Microservices
Service A â Event Bus â Service B/C/D â Message Queue â Downstream
Data Pipeline Flow
Data Source â Stream (Kafka) â ETL â Data Lake â Analytics/ML
Mobile/IoT Flow
Device â Gateway â Queue â Stream Processor â Time Series DB
ðĄ Buzzwords & Smart Phrases
CAP Theorem
Consistency vs Availability
Stateless
Idempotent
Backpressure
Fan-out/Fan-in
CQRS
Fail-fast
Fail-safe
Fail-open
Quick Decisions
- Read-heavy â Cache (Redis)
- Write-heavy â WAL + Strong DB
- Blob Storage â S3 + CDN
ðĻ Common Reusable Patterns
| Use Case | Solution |
|---|---|
| Read-heavy | Redis cache, replicas |
| Write-heavy | Queues, WAL, partitions |
| Global users | CDN, geo-replication |
| Media files | S3 + CDN |
| PII/Compliance | Encryption, access logs |
| Async processing | Queue + background workers |
| High availability | Multi-AZ, health checks |
| Real-time updates | WebSocket, Server-Sent Events |
| Batch processing | Kafka + Spark/Flink |
| Search | Elasticsearch, inverted index |
ðïļ Database Patterns & Data Modeling
Database Selection
- RDBMS: ACID, complex queries, transactions
- NoSQL Document: MongoDB, flexible schema
- Key-Value: Redis, DynamoDB, simple lookups
- Column-Family: Cassandra, time-series data
- Graph: Neo4j, relationships/networks
- Time-Series: InfluxDB, metrics/monitoring
Scaling Patterns
- Read Replicas: Scale reads, eventual consistency
- Sharding: Horizontal partitioning by key
- Federation: Split DBs by feature/service
- Denormalization: Trade space for speed
- CQRS: Separate read/write models
ð Distributed Systems Essentials
Core Concepts
- CAP Theorem: Consistency, Availability, Partition tolerance
- ACID vs BASE: Strong vs eventual consistency
- Consensus: Raft, Paxos for leader election
- Vector Clocks: Ordering events in distributed systems
- Bloom Filters: Probabilistic data structures
Failure Patterns
- Circuit Breaker: Prevent cascading failures
- Bulkhead: Isolate critical resources
- Timeout & Retry: Exponential backoff
- Graceful Degradation: Reduce functionality
- Health Checks: Liveness & readiness probes
ð Performance & Monitoring
Metrics & SLOs
- Golden Signals: Latency, Traffic, Errors, Saturation
- SLI/SLO/SLA: Service Level Indicators/Objectives/Agreements
- Error Budgets: Acceptable failure rate
- Percentiles: P50, P95, P99 for latency
Optimization Techniques
- Caching Layers: Browser â CDN â App â DB
- Connection Pooling: Reuse DB connections
- Batch Processing: Reduce network calls
- Compression: Gzip, Brotli for data transfer
- Indexing: B-tree, LSM-tree strategies
ð Security & Compliance Patterns
Authentication & Authorization
- OAuth 2.0/OpenID: Delegated authorization
- JWT: Stateless tokens
- RBAC: Role-based access control
- API Keys: Service-to-service auth
- mTLS: Mutual TLS for microservices
Data Protection
- Encryption: At rest (AES-256), in transit (TLS 1.3)
- PII Handling: Tokenization, hashing, masking
- Compliance: GDPR, HIPAA, SOC2
- Rate Limiting: Prevent abuse, DDoS protection
- WAF: Web Application Firewall
ðŊ Staff Engineer Rapid-Fire Answers
Scale Numbers (Memorize These)
- L1 Cache: 1ns | RAM: 100ns | SSD: 10-100Ξs
- Network: 1ms | Disk: 10ms | Internet: 100ms
- Read 1MB: Memory 250Ξs, SSD 1ms, Disk 20ms
- Typical Web: 1K-10K RPS | High Scale: 100K+ RPS
Quick Decision Tree
- < 1GB data: Single DB | < 1TB: Sharding/Replicas
- < 1K RPS: Monolith | < 10K: Microservices
- Global: CDN + Multi-region | Real-time: WebSocket/SSE
- Analytics: Data Lake | ML: Feature Store
Trade-off Questions
- Consistency vs Availability: Strong consistency = lower availability
- Latency vs Throughput: Caching improves latency, batching improves throughput
- Space vs Time: Denormalization trades space for query speed
- Reliability vs Cost: More redundancy = higher cost
ð Quick Reference & Resources
Essential Resources
- Books: "Designing Data-Intensive Applications" (Kleppmann)
- Papers: MapReduce, Dynamo, Bigtable, Raft
- Blogs: High Scalability, AWS Architecture Center
- Tools: Draw.io for diagrams, Figma for UI flows
Company Tech Stacks (Examples)
- Netflix: Microservices, Cassandra, Kafka, AWS
- Uber: Go/Java, Kafka, Cassandra, Real-time ML
- Airbnb: Rails, MySQL, Redis, Kafka, Kubernetes
- Pinterest: Python, MySQL, HBase, Kafka, Kubernetes