System Design Case Studies – System Design Sample Reasonings

🎵 Music Applications

Variant 1: Multi-Server Architecture

We have music application up and running. What are the advantages and disadvantages of running system on multiple servers?

🤔 Clarifying Questions:

What's the current user base and expected growth?
Are we talking about horizontal scaling or service decomposition?
What's the current bottleneck - CPU, storage, or network?
Do we need global distribution or single region?
What's the read/write ratio for music streaming vs uploads?

🎯 Main Issues & Analysis:

Current State: Single-server architecture hitting scalability limits

Core Challenge: Balancing scalability benefits with operational complexity

✅ Multi-Server Advantages:

Horizontal Scalability: Handle more concurrent users by adding servers
High Availability: Eliminate single point of failure
Geographic Distribution: Reduce latency with CDN and edge servers
Load Distribution: Separate read-heavy (streaming) from write-heavy (uploads) workloads
Resource Optimization: Dedicated servers for different functions (streaming, metadata, user management)

❌ Multi-Server Disadvantages:

Data Consistency: Eventual consistency challenges across replicas
Network Complexity: Inter-server communication overhead
Operational Overhead: Monitoring, deployment, and maintenance complexity
Cost: Higher infrastructure and operational costs
Distributed System Challenges: Network partitions, CAP theorem trade-offs

📋 Requirements Analysis:

Functional Requirements:

Stream music with low latency
Upload and store music files
User authentication and playlists
Search and discovery

Non-Functional Requirements:

Availability: 99.9% uptime
Scalability: Handle 10M+ concurrent users
Latency: <100ms for metadata, <2s for streaming start
Consistency: Strong for user data, eventual for music catalog

🛠️ Recommended Architecture:

Load Balancer: NGINX/HAProxy for traffic distribution

Application Servers: Multiple instances behind load balancer

Database: Master-slave MySQL for metadata, sharded by user_id

File Storage: S3/GCS for music files with CDN

Cache: Redis for session management and popular content

⚖️ Final Recommendation:

Go with multi-server architecture because:

Music streaming is inherently read-heavy and benefits from horizontal scaling
Global user base requires geographic distribution
The availability benefits outweigh complexity costs
Start with microservices: User Service, Music Service, Streaming Service

Variant 2: Consistent Hashing Load Distribution

Given a music streaming and uploading service, highly scalable system works on consistent hashing. Given that load is equally distributed on all servers. Do you see any concern with this architecture?

🤔 Clarifying Questions:

What's the hashing key - user_id, song_id, or something else?
Are we talking about data distribution or request routing?
What's the replication factor?
How do we handle hot users/songs?
What happens during server failures?

🚨 Concerns with Pure Consistent Hashing:

Hot Data Problem: Popular songs/artists create uneven load despite equal distribution

Geographic Latency: Users may hit servers far from their location

Cache Efficiency: Related data scattered across servers

✅ Sample Reasonings & Improvements:

Virtual Nodes: Each physical server handles multiple virtual nodes for better distribution
Load-Aware Routing: Monitor server load and route requests to less loaded servers
Hierarchical Hashing: Geographic clustering + consistent hashing within regions
Hot Data Replication: Replicate popular content to multiple servers
Adaptive Caching: Cache popular content at edge servers

📋 System Requirements:

Functional: Consistent data access, fault tolerance

Non-Functional: Low latency, high availability, even load distribution

This is a read-heavy system with occasional writes (uploads)

🛠️ Enhanced Architecture:

Consistent Hashing Ring: With virtual nodes (150-200 per server)

Load Balancer: Weighted round-robin with health checks

CDN: CloudFront/CloudFlare for static music files

Monitoring: Real-time load monitoring and alerting

⚖️ Recommendation:

Use enhanced consistent hashing with:

Virtual nodes for better distribution
Geographic awareness for latency
Hot data detection and replication
Load monitoring and adaptive routing

Variant 3: Distributed Song Storage

A music streaming service has the songs distributed across servers. What are the potential problems?

🤔 Clarifying Questions:

How are songs distributed - by artist, genre, or random?
What's the replication strategy?
How do we handle metadata vs audio files?
What's the failover mechanism?
How do we discover which server has a specific song?

🚨 Potential Problems:

Single Point of Failure: If server with unique song goes down, song becomes unavailable
Metadata Inconsistency: Song metadata and audio files may be out of sync
Discovery Overhead: Need efficient way to locate songs across servers
Hot Partitions: Servers with popular songs get overloaded
Network Latency: Cross-server requests for playlists spanning multiple servers
Backup Complexity: Ensuring all songs are properly backed up

✅ Sample Reasonings:

Replication Strategy: Minimum 3 replicas per song across different servers
Metadata Service: Centralized metadata store with song location mapping
Load Balancing: Intelligent routing to least loaded replica
Caching Strategy: Cache popular songs at edge servers
Health Monitoring: Real-time server health checks and automatic failover

📋 System Design Considerations:

Availability: 99.95% - music must be always accessible

Consistency: Eventual consistency for song catalog, strong for user preferences

Partition Tolerance: System must work despite network failures

Read-Heavy Workload: Optimize for fast reads, streaming

🛠️ Recommended Architecture:

Metadata Store: MongoDB cluster with song-to-server mapping

File Storage: Distributed file system (HDFS) or object storage (S3)

Service Discovery: Consul/Eureka for server registry

Load Balancer: HAProxy with health checks

CDN: Multi-tier caching strategy

⚖️ Final Recommendation:

Implement distributed storage with:

3-way replication for high availability
Separate metadata and audio file storage
Intelligent load balancing and failover
CDN for popular content

📁 File Storage & Deduplication

Variant 1: File Size & Byte Comparison

User is uploading similar files to server. In order to save space, we check file size. If file size is same we compare byte by byte and then create symlink of new file to old file. What is problem with the system and any optimized approach?

🤔 Clarifying Questions:

What's the average file size and upload frequency?
Are files shared between users or private?
What's the storage budget and deduplication target?
Do we need to maintain separate file metadata per user?
What's the acceptable latency for uploads?

🚨 Problems with Current Approach:

Performance Bottleneck: Byte-by-byte comparison is O(n) and CPU intensive
False Positives: Files with same size but different content will trigger comparison
Scalability Issues: Doesn't scale with large files or high upload frequency
Symlink Fragility: Symlinks break if original file is deleted
Metadata Confusion: Different users' files may have same content but different metadata

✅ Optimized Approach:

Content Hashing: Use SHA-256 or MD5 hash as primary deduplication key
Chunked Hashing: For large files, use rolling hash (similar to rsync)
Reference Counting: Track how many users reference each unique file
Metadata Separation: Store file content once, metadata per user
Lazy Deletion: Only delete files when reference count reaches zero

📋 System Requirements:

Functional: Efficient deduplication, data integrity, user isolation

Non-Functional: Fast uploads, space efficiency, data consistency

Write-Heavy System: Optimize for quick file processing

🛠️ Enhanced Architecture:

Hash Database: Redis for fast hash lookups

File Storage: Content-addressed storage (hash-based paths)

Metadata DB: PostgreSQL for user file metadata

Queue System: Async processing for hash computation

                    Key Insight: Move from comparing files to comparing hashes - O(1) lookup instead of O(n) comparison
                

⚖️ Implementation Strategy:

Hash Generation: Compute SHA-256 during upload
Hash Lookup: Check if hash exists in database
Reference Management: Increment reference count or store new file
Metadata Storage: Always store per-user metadata separately

Variant 2: General File Deduplication

File system - users upload files, sometimes they upload duplicate files. Find way to not store duplicate files.

🤔 Clarifying Questions:

What's the scale - number of users and files?
Are duplicates within same user or across users?
What file types and sizes are we dealing with?
Do we need to preserve user-specific metadata?
What's the privacy requirement - can users access others' files?

🎯 Core Challenge:

Duplicate Detection: Efficiently identify identical files across large dataset

Privacy Concerns: Users shouldn't access each other's files even if identical

Metadata Handling: Same content, different names/permissions per user

✅ Comprehensive Sample Reasoning:

Content-Addressable Storage: Store files by content hash
Multi-Level Deduplication:

File-level: Complete file deduplication
Block-level: Chunk-based deduplication for large files

Virtual File System: User sees their own file tree, backend stores deduplicated content
Reference Tracking: Maintain reference count per content hash

📋 System Requirements:

Scalability: Handle millions of files efficiently

Privacy: Strong user isolation

Consistency: Strong consistency for user data

Space Efficiency: Maximize storage savings

🛠️ Architecture Components:

Content Store: S3/GCS with content-addressable paths

Metadata DB: PostgreSQL with user file mappings

Hash Index: Redis for fast content hash lookups

Processing Queue: Kafka for async deduplication

Block Storage: For chunk-level deduplication

                    Architecture Pattern: Separate content storage from user namespace - one content store, multiple user views
                

⚖️ Implementation Approach:

Upload Process: Hash → Check existence → Store/Reference
User Interface: Virtual file system with user-specific metadata
Garbage Collection: Periodic cleanup of unreferenced content
Monitoring: Track deduplication ratio and storage savings

🔒 Password Validation

Variant 1: English Word Restriction

Password validation. Password needs to be English word 8-16 chars. 1 upper case and 1 symbol. What is the problem with the approach?

🤔 Clarifying Questions:

Are we allowing dictionary words or requiring non-dictionary words?
What's the target user base - global or English-speaking only?
What's the security threat model we're protecting against?
Are there regulatory compliance requirements?
What's the acceptable user experience for password creation?

🚨 Major Problems:

Dictionary Attack Vulnerability: English words are easily guessable
Reduced Entropy: Limited to ~170,000 English words vs random combinations
Predictable Patterns: Users will add symbols/caps in predictable ways
Cultural Bias: Excludes non-English speakers
Brute Force Weakness: Much smaller search space for attackers

✅ Improved Approach:

Entropy-Based Validation: Minimum 50+ bits of entropy
Blacklist Common Passwords: Check against known breach databases
Passphrase Support: Allow multiple words with spaces
Multi-Factor Authentication: Reduce password burden with 2FA
Password Strength Meter: Real-time feedback to users

📋 Security Requirements:

Functional: Strong authentication, user-friendly validation

Non-Functional: High security, good UX, regulatory compliance

Threat Model: Protect against credential stuffing, brute force, social engineering

🛠️ Security Architecture:

Password Hashing: bcrypt/scrypt/Argon2 with salt

Breach Database: HaveIBeenPwned API integration

Rate Limiting: Prevent brute force attempts

Audit Logging: Track authentication attempts

                    Key Principle: Security through entropy, not obscurity. Random characters > predictable patterns
                

⚖️ Recommended Policy:

Minimum 12 characters OR high entropy score
No dictionary word restrictions - check against breach databases instead
Support passphrases - multiple words are stronger than complex single words
Mandatory 2FA for high-value accounts

Variant 2: Password Management System

You are designing a password management system. The rules are: The password must not contain any English words. The password must be between 8-16 letters and must contain 1 of Uppercase letter, lowercase letter and a symbol. Do you see any issues with this? What would you change to improve it?

🤔 Clarifying Questions:

Are we storing passwords for users or generating them?
What's the target security level - enterprise or consumer?
Do we need to support password sharing between users?
What's the master password policy?
Are there integration requirements with other systems?

🚨 Issues with Current Rules:

Length Limitation: 16 char max is too restrictive for strong passwords
Composition Rules: Rigid requirements reduce actual entropy
English-Only Focus: Doesn't consider other languages
User Experience: Difficult to create compliant passwords
False Security: Complex rules don't guarantee strong passwords

✅ Enhanced Password Management System:

Password Generation: Automatic generation of high-entropy passwords
Flexible Length: Support passwords up to 256 characters
Entropy Measurement: Real-time entropy calculation
Zero-Knowledge Architecture: Client-side encryption/decryption
Secure Storage: End-to-end encryption with user master key

📋 System Requirements:

Functional: Store, generate, and auto-fill passwords

Non-Functional: Zero-knowledge security, high availability, cross-platform

Security Model: Even the service provider cannot access user passwords

🛠️ Architecture Components:

Client Apps: Browser extensions, mobile apps with local crypto

Encryption: AES-256 with PBKDF2/scrypt for key derivation

Backend: Store encrypted blobs, no access to plaintext

Sync: Secure multi-device synchronization

Backup: Encrypted backup with recovery keys

                    Core Principle: Generate strong passwords automatically, don't force users to create weak ones manually
                

⚖️ Improved Password Policy:

Auto-Generate: Default to 20+ character random passwords
Flexible Rules: Allow sites to specify their own requirements
Entropy-Based: Measure actual password strength
User Choice: Support both generated and custom passwords
Secure by Default: Strongest settings as default

💾 Data Storage & Cost Estimation

Variant 1: Logging System Storage

We have system running for logging. How would you estimate data storage cost for next 1 year?

🤔 Clarifying Questions:

What's the current daily/monthly log volume?
What types of logs - application, access, error, audit?
What's the retention policy and compliance requirements?
Are logs structured (JSON) or unstructured (text)?
What's the expected business growth rate?
Current storage infrastructure - cloud or on-premise?

🎯 Estimation Approach:

Data Collection: Establish baseline metrics

Growth Modeling: Account for business and technical growth

Storage Tiers: Consider different storage classes

✅ Estimation Framework:

Current Metrics:
- Average log entry size (e.g., 200 bytes)
- Logs per second per server (e.g., 50/sec)
- Number of servers (e.g., 100)
- Daily volume = 100 servers × 50 logs/sec × 86400 sec × 200 bytes = 8.64 GB/day
Growth Factors:
- Business growth: 50% more users/year
- Infrastructure growth: 30% more servers
- Feature growth: 20% more logging
Storage Tiers:
- Hot (0-7 days): Frequent access, SSD storage
- Warm (7-90 days): Occasional access, standard storage
- Cold (90+ days): Archive, glacier storage

📋 Storage Requirements:

Functional: Reliable storage, fast search, compliance

Non-Functional: Cost-effective, scalable, durable

Write-Heavy System: Optimize for high write throughput

🛠️ Storage Architecture:

Hot Tier: Elasticsearch cluster for search

Warm Tier: S3 Standard for occasional access

Cold Tier: S3 Glacier for long-term retention

Processing: Kafka for real-time ingestion

Compression: Gzip compression (3-5x reduction)

                    Sample Calculation:

                    Current: 8.64 GB/day × 365 days = 3.15 TB/year

                    With growth: 3.15 TB × 1.5 × 1.3 × 1.2 = 7.37 TB/year

                    With compression: 7.37 TB ÷ 4 = 1.84 TB actual storage

⚖️ Cost Estimation (AWS pricing):

Hot Storage (7 days): 36 GB × $0.23/GB = $8.28/month
Warm Storage (83 days): 430 GB × $0.023/GB = $9.89/month
Cold Storage (275 days): 1.37 TB × $0.004/GB = $5.48/month
Total Monthly: ~$24/month = $288/year
With 20% buffer: $346/year

Variant 2: Photo Storage Service

We are running a simple photo storage and sharing service. People upload their photos to our servers and then give links to other users who can then view them. Instead of using a cloud service, we have our own server farms. You've been tasked with creating an estimate of the storage required over the coming year and the cost of that storage. What information would you need and what factors would you consider as you generate this estimate?

🤔 Information Needed:

User Metrics: Current users, growth rate, user behavior patterns
Photo Characteristics: Average file size, formats, resolution
Usage Patterns: Upload frequency, sharing ratio, retention
Business Model: Storage limits, premium features, monetization
Technical Constraints: Bandwidth, server capacity, budget

🎯 Key Factors to Consider:

Growth Patterns: User acquisition, seasonal variations
Storage Overhead: Metadata, thumbnails, redundancy
Hardware Lifecycle: Replacement cycles, capacity planning
Operational Costs: Power, cooling, maintenance

✅ Estimation Model:

Baseline Metrics:
- Current users: 100K
- Average photo size: 3MB
- Photos per user per month: 50
- Monthly upload volume: 100K × 50 × 3MB = 15TB
Growth Projections:
- User growth: 20% monthly
- Usage growth: 10% monthly (more engagement)
- Photo size growth: 5% monthly (better cameras)
Storage Overhead:
- Thumbnails: 50KB per photo
- Metadata: 1KB per photo
- Redundancy: 3x replication
- Total multiplier: 3.5x

📋 System Requirements:

Functional: Store photos, generate thumbnails, share links

Non-Functional: High availability, fast access, cost-effective

Read-Heavy System: Optimize for fast photo retrieval

🛠️ Storage Infrastructure:

Hot Storage: SSDs for recent photos (30 days)

Warm Storage: HDDs for older photos (6 months)

Cold Storage: Tape backup for long-term retention

CDN: Edge caching for popular photos

Compression: Lossless compression for archival

                    Year-End Projection:

                    Month 1: 15TB → Month 12: 15TB × (1.2 × 1.1 × 1.05)^11 = 435TB

                    Total yearly storage: ~2.4PB

                    With overhead: 2.4PB × 3.5 = 8.4PB

⚖️ Cost Breakdown:

Hardware: 8.4PB × $50/TB = $420,000
Infrastructure: Servers, networking = $150,000
Operational: Power, cooling, maintenance = $180,000/year
Personnel: DevOps, support = $300,000/year
Total: $1,050,000 first year

🎮 Arcade/Gaming Systems

Variant 1: Card Sensor System

We have card for arcade games which you need to recharge to use or you can link with credit card. The sensor on the machine read this card and allow user to play if has enough money. We will add this sensor system to 125,000 machines. What are the concerns do you have in this approach?

🤔 Clarifying Questions:

What's the connectivity model - online or offline capable?
How do we handle network failures and payments?
What's the fraud prevention and security model?
Are cards shared between different arcade locations?
What's the expected transaction volume per machine?

🚨 Major Concerns:

Network Reliability: 125K machines need constant connectivity
Payment Processing: Credit card transactions at scale
Fraud Prevention: Card cloning, charge-back protection
Offline Capability: Machines must work during network outages
Data Synchronization: Balance updates across all machines
Security: PCI compliance, encryption, tamper resistance

✅ Sample Reasonings & Mitigations:

Hybrid Architecture: Online primary, offline backup mode
Local Balance Cache: Store encrypted balance on card
Batch Processing: Queue transactions during outages
Distributed Validation: Machine-to-machine verification
Fraud Detection: Real-time anomaly detection
Secure Hardware: HSM for encryption keys

📋 System Requirements:

Functional: Process payments, track balances, prevent fraud

Non-Functional: 99.9% uptime, PCI compliance, low latency

Scale: 125K machines, millions of transactions daily

🛠️ Architecture:

Central System: Microservices for payment processing

Edge Computing: Local servers in each arcade

Card Technology: NFC with secure element

Connectivity: 4G/5G with WiFi backup

Database: Distributed database with eventual consistency

                    Key Insight: Design for offline-first operation - machines must work even when disconnected
                

⚖️ Implementation Strategy:

Pilot Program: Start with 1,000 machines to validate
Gradual Rollout: Phase deployment to manage risk
Fallback Plan: Maintain token-based system as backup
Security First: Implement end-to-end encryption

Variant 2: Tap Card Arcade

Tap card arcade

🤔 Clarifying Questions:

What's the specific problem - designing the system or troubleshooting?
How many locations and machines are involved?
What's the current payment flow and technology stack?
Are there specific performance or reliability issues?
What's the budget and timeline for implementation?

🎯 System Design Considerations:

User Experience: Fast, reliable tap-to-play experience

Technical Challenges: NFC reliability, payment processing, balance management

Business Requirements: Revenue tracking, fraud prevention, customer retention

✅ Tap Card System Design:

NFC Cards: Contactless payment with secure element
Dual Storage: Balance on card + central database
Instant Response: <200ms transaction time
Offline Mode: Local balance validation
Real-time Sync: Background synchronization

📋 System Requirements:

Functional: Tap-to-pay, balance management, game activation

Non-Functional: Sub-second response, 99.95% availability

Transaction-Heavy: Optimize for high-frequency small transactions

🛠️ Technical Stack:

Cards: NFC-enabled smart cards with secure element

Readers: NFC readers with tamper detection

Gateway: Local payment gateway per location

Backend: Cloud-based balance management system

Analytics: Real-time transaction monitoring

                    Design Principle: Optimize for user experience - every tap should result in immediate game activation
                

⚖️ Implementation Approach:

Card-First: Prioritize offline capability
Redundancy: Multiple validation methods
Monitoring: Real-time alerts for system issues
Scalability: Design for peak usage periods

🎥 Video Processing

Variant 1: Subtitle Generation Service

We are working on a service that generates subtitles for users' videos. This process starts a new thread for every video and is processor-intensive. Currently, this service runs as a single process on a machine. We've run into a bug where if the service is processing more than 10 videos at the same time, the service crashes the server, losing all requests currently being processed and affecting other processes on the machine. It may take a long time to find and fix this bug. What workarounds could we implement to continue running the service while we do?

🤔 Clarifying Questions:

What's the average video processing time?
What's the current request volume and expected growth?
Are there resource constraints - CPU, memory, or disk?
What's the acceptable processing delay for users?
Can we use multiple machines or must we work with current setup?

🚨 Current Problems:

Resource Exhaustion: Too many concurrent threads causing crashes
Work Loss: All progress lost when service crashes
System Impact: Crashes affect other processes on same machine
No Graceful Degradation: Hard limit causing complete failure
No Persistence: No way to resume interrupted work

✅ Immediate Workarounds:

Concurrency Limit: Implement thread pool with max 8 threads
Request Queue: Queue incoming requests, process FIFO
Process Isolation: Run service in containerized environment
Checkpointing: Save progress periodically to resume work
Circuit Breaker: Stop accepting new requests when overloaded
Resource Monitoring: Monitor CPU/memory and throttle accordingly

📋 System Requirements:

Functional: Generate subtitles, handle concurrent requests

Non-Functional: Fault tolerance, no work loss, system stability

CPU-Intensive: Optimize for computational efficiency

Video Processing - Subtitle Generation Service

We are working on a service that generates subtitles for users' videos. This process starts a new thread for every video and is processor-intensive. Currently, this service runs as a single process on a machine. We've run into a bug where if the service is processing more than 10 videos at the same time, the service crashes the server, losing all requests currently being processed and affecting other processes on the machine. It may take a long time to find and fix this bug. What workarounds could we implement to continue running the service while we do?

Clarifying Questions

What's the average video length and processing time?
What's the current throughput requirement (videos per hour)?
Are there any SLA requirements for subtitle generation?
Can we process videos asynchronously or does it need to be real-time?
What's the server specification and resource utilization?
Are there any budget constraints for additional infrastructure?

Main Sample Reasoning

Immediate Workarounds:

Resource Isolation with Containers: Run each video processing task in a separate Docker container with resource limits (CPU, memory). This prevents one task from crashing the entire system.
Process Queue with Circuit Breaker: Implement a queue system (Redis/RabbitMQ) that limits concurrent processing to 8-9 videos (below the crash threshold). Use circuit breaker pattern to prevent system overload.
Horizontal Scaling: Deploy multiple instances of the service across different machines, each handling a subset of the load.
Graceful Degradation: Implement health checks and automatic service restart mechanisms to minimize downtime.

Architecture:

Load Balancer → Queue System → Worker Nodes (Containerized)
     ↓              ↓             ↓
API Gateway → Redis Queue → [Worker1, Worker2, Worker3]
                              ↓
                         Database (Job Status)
                

NFRs & FRs

Functional Requirements:

Generate subtitles for uploaded videos
Support multiple video formats
Provide job status tracking
Handle video upload and subtitle download

Non-Functional Requirements:

Reliability: 99.9% uptime, fault tolerance
Scalability: Handle varying loads, auto-scaling
Performance: Process within reasonable time limits
Availability: Service should remain available during peak loads
Consistency: Eventual consistency for job status

Trade-offs & Considerations

Approach	Pros	Cons
Containerization	Resource isolation, easy deployment	Overhead, complexity
Queue System	Controlled processing, fault tolerance	Additional infrastructure, latency
Horizontal Scaling	Higher throughput, redundancy	Higher costs, coordination complexity

Vending Machine - Internet-Connected Infrastructure

You're working on infrastructure for internet-connected vending machines. The plan is to install around 188,888 of these vending machines in the coming year, in major cities around the world. These machines will connect to the internet through cellular network. Each machine will connect to a central server at midnight to report remaining stock and any maintenance issues like coin jams or stuck items. These machine status updates will be stored in a database, and a batch job will run at 1 AM to schedule the restocking and maintenance of machines. Are there any problems with the above design? How to solve these problems?

Clarifying Questions

What's the data size for each machine's status report?
Are there any real-time requirements for urgent maintenance issues?
What's the acceptable downtime for maintenance scheduling?
Are there regional compliance requirements for data storage?
What's the budget for cellular connectivity and server infrastructure?
Do machines need to support offline operation?

Problems Identified

Major Issues:

Thundering Herd Problem: All 188,888 machines connecting simultaneously at midnight will overwhelm the server
Single Point of Failure: Central server failure affects all machines globally
Time Zone Complexity: "Midnight" varies across global locations
Network Congestion: Cellular networks may struggle with simultaneous connections
Database Bottleneck: Batch processing of ~189K records at once
Maintenance Scheduling Delay: 1-hour gap between reporting and scheduling

Proposed Sample Reasonings

Architecture Improvements:

Distributed Regional Architecture:
- Regional data centers with local servers
- Machines connect to nearest regional server
- Data replication between regions
Staggered Reporting Schedule:
- Distribute connections across 2-3 hour window
- Use machine ID hash to determine reporting slot
- Implement exponential backoff for failed connections
Event-Driven Processing:
- Real-time processing for critical issues
- Stream processing for continuous updates
- Batch processing for bulk operations

Machines → Regional Load Balancer → Regional Server Cluster
    ↓              ↓                        ↓
Message Queue → Stream Processor → Database Cluster
    ↓              ↓                        ↓
Priority Queue → Maintenance Scheduler → Notification Service
                

NFRs & FRs

Functional Requirements:

Collect machine status reports (inventory, maintenance issues)
Schedule restocking and maintenance
Handle global deployment across time zones
Support offline operation and data synchronization

Non-Functional Requirements:

Scalability: Handle 188K+ machines with linear growth
Availability: 99.9% uptime, regional redundancy
Performance: Handle concurrent connections efficiently
Consistency: Eventual consistency for non-critical updates
Reliability: Message delivery guarantees, retry mechanisms

Trade-offs & Considerations

Sample Reasoning	Pros	Cons
Regional Architecture	Reduced latency, fault isolation	Higher complexity, data consistency challenges
Staggered Reporting	Smooth load distribution	Delayed insights, implementation complexity
Stream Processing	Real-time insights, better resource utilization	Higher infrastructure costs

Mobile Game Analysis - Go Game Analysis

We are working on a mobile app for the board game Go. We'd like to add a feature where the computer will analyze a complete game. The analysis looks at each position from the game and provides suggested moves to help improve our users' play. We've found a library we can use to do this analysis. It takes an average of a minute on a modern desktop computer to analyze entire game. An average game consists of about 200 moves. We are considering two approaches: 1) running this analysis on the phone itself, and 2) sending the game to a server farm for analysis that will be returned to the user. What are some advantages or disadvantages of each approach?

Clarifying Questions

What's the target device specification range (low-end to high-end)?
How frequently do users request game analysis?
Are there any real-time requirements for analysis results?
What's the acceptable battery drain for mobile processing?
Are there any privacy concerns with sending game data to servers?
What's the expected user base and concurrent analysis requests?

Comparative Analysis

Aspect	Mobile Processing	Server Processing
Performance	Varies by device (2-10 minutes)	Consistent (30-60 seconds)
Battery Impact	High CPU usage, significant drain	Minimal, just network activity
Network Dependency	None required	Requires stable internet
Privacy	Complete data privacy	Data transmitted to servers
Cost	No ongoing costs	Server infrastructure costs
Scalability	Scales with user devices	Requires capacity planning

Recommended Hybrid Approach

Adaptive Processing Strategy:

Device Classification:
- High-end devices: Mobile processing with user choice
- Mid-range devices: Server processing by default
- Low-end devices: Server processing only
Progressive Analysis:
- Quick analysis (key moves) on mobile
- Detailed analysis on server
- Cached results for common patterns
Intelligent Queueing:
- Priority queues for paying users
- Background processing for free users
- Load balancing across server clusters

NFRs & FRs

Functional Requirements:

Analyze complete Go games (200+ moves)
Provide move suggestions and position evaluation
Support both mobile and server processing
Handle various game formats and rule sets

Non-Functional Requirements:

Performance: Analysis completion within 2-5 minutes
Scalability: Handle thousands of concurrent analyses
Availability: 99.5% uptime for server processing
Usability: Seamless user experience across devices
Efficiency: Minimal battery drain on mobile devices

Architecture Considerations

Mobile App → Load Balancer → Analysis Service Cluster
    ↓              ↓                    ↓
Device Check → Queue Manager → [Worker Nodes]
    ↓              ↓                    ↓
Local/Remote → Priority Queue → Result Cache
Decision           ↓                    ↓
                 Database → Push Notification Service
                

Key Components:

Analysis Engine: Containerized Go analysis library
Queue System: Redis/RabbitMQ for job management
Caching Layer: Redis for common game patterns
Notification Service: Push notifications for completion

Photo Sharing - Alphabetical Username Sharding

We are running a simple photo storage and sharing service. People upload their photos to our servers and then give links to other users who can then view them. We're trying to figure out how to split the photos and associated data evenly onto multiple machines, especially as we get new users. We've decided to shard the photos evenly alphabetically by username. For example, if we had 26 servers, all the usernames starting with 'a' would be on server 1, usernames starting with 'b' would be on server 2, and so on. We have created a scheme like this that will work for any number of servers. Are there any problems with this design? How to solve these?

Clarifying Questions

What's the expected user growth rate and photo upload frequency?
Are there any geographic distribution requirements?
What's the average photo size and storage requirements?
Are there any requirements for data replication or backups?
What's the read vs write ratio for the application?
Are there any compliance requirements for data location?

Problems with Alphabetical Sharding

Major Issues:

Uneven Distribution: Names starting with certain letters are more common (e.g., 'S', 'M', 'C') leading to hotspots
Predictable Patterns: Usernames often follow patterns (company names, common words) causing skewed distribution
Limited Scalability: Adding new servers requires resharding significant portions of data
Cultural Bias: Distribution varies significantly across languages and cultures
Gaming Vulnerability: Users could exploit the system by choosing usernames strategically

Better Sharding Strategies

Recommended Approaches:

Consistent Hashing:
- Hash username to get uniform distribution
- Easy to add/remove servers with minimal data movement
- Use SHA-256 or similar for even distribution
Range-based Sharding with Monitoring:
- Monitor shard sizes and rebalance when needed
- Use consistent hashing for automatic rebalancing
- Implement shard splitting when threshold is reached
Hybrid Approach:
- Use consistent hashing for user data
- Separate sharding strategy for photos (by ID or date)
- Implement cross-references for data location

Username → Hash Function → Shard Selection
    ↓              ↓              ↓
"john_doe" → SHA-256 → hash % num_shards = shard_id
    ↓              ↓              ↓
Load Balancer → Consistent Hash Ring → Target Server
                

NFRs & FRs

Functional Requirements:

Store and retrieve photos for users
Generate shareable links for photos
Support user authentication and authorization
Handle photo metadata and organization

Non-Functional Requirements:

Scalability: Handle millions of users and photos
Availability: 99.9% uptime for photo access
Performance: Fast photo upload/download times
Consistency: Eventual consistency for non-critical metadata
Durability: Photos should never be lost

Architecture Design

Client → CDN → Load Balancer → API Gateway
  ↓        ↓         ↓              ↓
Upload → Cache → Auth Service → User Service
  ↓        ↓         ↓              ↓
Storage → Metadata DB → Consistent Hash Ring
  ↓        ↓         ↓              ↓
[Photo Storage Shards] → [User Data Shards]
                

Key Components:

Consistent Hash Ring: Even distribution and easy scaling
Photo Storage: Separate sharding by photo ID or date
Metadata Database: Stores user-photo relationships
CDN: Global distribution for faster access
Replication: Multiple copies for durability

Crossword Puzzle - Hints Storage Strategy

Given a crossword puzzle gaming application which is giving hints to users, what are the advantages and disadvantages of both following approaches: -fetching hints from server -preloading hints on device

Clarifying Questions

What's the total size of all hints data?
How frequently are hints updated or new puzzles added?
Are there different difficulty levels requiring different hint sets?
What's the target device storage capacity?
Are there offline usage requirements?
What's the user engagement pattern (daily vs occasional)?

Comparative Analysis

Aspect	Server Fetching	Device Preloading
Storage	Minimal device storage	Significant storage required
Network	Requires internet connection	Only for updates
Performance	Network latency for each request	Instant access
Updates	Real-time updates possible	Requires app updates
Data Freshness	Always up-to-date	May be stale
Offline Support	Not available	Full offline capability
Bandwidth Usage	Continuous small requests	Large initial download
Server Load	High, scales with users	Low, mainly for updates

Recommended Hybrid Approach

Intelligent Caching Strategy:

Tiered Storage:
- Core hints (most common) preloaded on device
- Extended hints fetched on-demand
- Personalized hints based on user behavior
Predictive Caching:
- Cache hints for puzzles likely to be played
- Background sync during Wi-Fi connection
- User preference-based caching
Adaptive Strategy:
- Monitor device storage and network conditions
- Adjust caching strategy based on usage patterns
- Implement cache expiration and cleanup

NFRs & FRs

Functional Requirements:

Provide hints for crossword puzzles
Support offline usage
Handle frequent updates and new puzzles
Optimize for various device storage capacities

Non-Functional Requirements:

Performance: Instant hint access for preloaded hints
Scalability: Handle millions of users and puzzles
Availability: 99.9% uptime for server fetching
Usability: Seamless user experience across devices
Efficiency: Minimal device storage usage

Architecture Design

App Launch → Check Cache → Fetch Missing Hints
    ↓              ↓              ↓
Device Storage → Local DB → Server API
    ↓              ↓              ↓
Cache Manager → Sync Service → Background Updates
                

NFRs & FRs

Functional Requirements:

Provide hints for crossword puzzles
Support multiple difficulty levels
Handle hint categories and tags
Support offline gameplay

Non-Functional Requirements:

Performance: Instant hint delivery (< 100ms)
Availability: 99.5% uptime for hint service
Scalability: Support millions of concurrent users
Efficiency: Minimal bandwidth and storage usage
Reliability: Consistent hint quality and accuracy

Implementation Strategy

Cache Management:

LRU Cache: Keep most recently used hints
Size Limits: Configurable cache size based on device
Background Sync: Update cache during idle time
Compression: Reduce storage footprint

Network Optimization:

Batch Requests: Fetch multiple hints together
Delta Updates: Only sync changed hints
CDN Integration: Global distribution for faster access
Fallback Mechanisms: Graceful degradation without hints

XML Processing - Large File Processing

A huge XML file with sales data needs to be processed. It is huge enough that it cannot be loaded at once given the RAM limitation of the local system. How can we process it?

Clarifying Questions

What's the approximate file size and RAM limitations?
What type of processing is required (aggregation, transformation, filtering)?
Are there any real-time processing requirements?
Is the XML structure known and consistent?
Can the file be preprocessed or split beforehand?
Are there any accuracy requirements (100% vs approximate)?

Processing Strategies

Streaming-Based Approaches:

SAX Parser (Event-driven):
- Process XML elements as they are read
- Memory usage remains constant
- Suitable for sequential processing
StAX Parser (Pull-based):
- More control over parsing process
- Can pause/resume processing
- Better for complex logic
Custom Chunking:
- Split file into smaller chunks
- Process each chunk independently
- Merge results at the end

Large XML File → Stream Parser → Processing Logic
     ↓               ↓               ↓
File Reader → Event Handler → Aggregator/Transformer
     ↓               ↓               ↓
Memory Buffer → Partial Results → Final Output
                

Detailed Implementation

SAX Parser Implementation:

// Pseudo-code for SAX processing
class SalesDataHandler extends SAXHandler {
    private aggregator = new DataAggregator();
    private currentRecord = new SalesRecord();
    
    onStartElement(element) {
        if (element == "sale") {
            currentRecord = new SalesRecord();
        }
    }
    
    onEndElement(element) {
        if (element == "sale") {
            aggregator.process(currentRecord);
            currentRecord = null; // Free memory
        }
    }
    
    onCharacters(data) {
        currentRecord.addData(data);
    }
}
                

Alternative Approaches:

MapReduce: For distributed processing across multiple machines
Database Streaming: Load data directly into database using bulk operations
External Sorting: For operations requiring sorted data

NFRs & FRs

Functional Requirements:

Process large XML files without loading entirely into memory
Support various processing operations (aggregation, filtering, transformation)
Handle malformed XML gracefully
Provide progress tracking and resumability

Non-Functional Requirements:

Memory Efficiency: Constant memory usage regardless of file size
Performance: Process files in reasonable time
Scalability: Handle files of any size
Reliability: Recover from processing errors
Accuracy: Maintain data integrity during processing

Performance Optimizations

Technique	Memory Usage	Processing Speed	Complexity
SAX Parser	Very Low	Fast	Medium
StAX Parser	Low	Medium	High
File Chunking	Medium	Variable	Low
Parallel Processing	Medium	Very Fast	High

URL Processing - Smart Engine Service Budget

Given a smart engine service which takes URLs from users and processes out some useful data from it. You have to plan the budget of this project. What things you will take into consideration? Expectations were basically, what things you will ask client to figure out capacity estimation parameters.

Clarifying Questions

What's the expected number of URLs processed per day/hour?
What's the average processing time per URL?
What type of data extraction is required (text, images, metadata)?
Are there any SLA requirements for processing time?
What's the expected user base and growth trajectory?
Are there any compliance or legal requirements?
What's the acceptable downtime and error rates?
Are there any geographic distribution requirements?

Capacity Estimation Parameters

Core Metrics to Gather:

Traffic Patterns:
- Peak requests per second (RPS)
- Average requests per day
- Seasonal variations and growth projections
- Geographic distribution of users
Processing Requirements:
- Average URL response time and size
- Processing complexity (CPU, memory, I/O intensive)
- Storage requirements for processed data
- Caching potential and hit rates
Quality Requirements:
- Availability SLA (99.9%, 99.99%)
- Response time requirements
- Error tolerance levels
- Data retention policies

Budget Components

Infrastructure Costs:

Component	Cost Factors	Estimation Method
Compute Resources	CPU, Memory, Number of instances	RPS × Processing time × Resource requirements
Storage	Data size, Retention period, Replication	Daily data × Retention days × Replication factor
Network	Bandwidth, CDN, Data transfer	Request size × RPS × Geographic distribution
Database	Read/Write operations, Storage	Query complexity × Transaction volume
Monitoring	Metrics, Logging, Alerting	5-10% of total infrastructure cost

Sample Calculation

Example Scenario:

Assumptions:
- 1M URLs per day (average)
- 10 seconds average processing time per URL
- Peak traffic: 3x average (during business hours)
- 99.9% availability requirement
- 30-day data retention

Calculations:
- Peak RPS: 1M / (24*3600) * 3 = ~35 RPS
- Concurrent processing: 35 * 10 = 350 concurrent jobs
- Server capacity: 350 jobs / 10 jobs per server = 35 servers
- With redundancy (2x): 70 servers
- Storage: 1M URLs * 50KB per result * 30 days = 1.5TB
- Network: 1M * 100KB average page size = 100GB/day
                

Cost Breakdown (Monthly):

Compute: 70 servers × $100/month = $7,000
Storage: 1.5TB × $50/TB = $75
Network: 3TB × $20/TB = $60
Database: $500 (managed service)
Monitoring: $400
Total: ~$8,000/month

NFRs & FRs

Functional Requirements:

Accept URLs from users for processing
Extract useful data from web pages
Store and provide processed results
Handle various content types and formats

Non-Functional Requirements:

Scalability: Handle varying loads efficiently
Performance: Process URLs within acceptable timeframes
Availability: Meet SLA requirements
Cost Efficiency: Optimize resource utilization
Reliability: Consistent service quality

Cost Optimization Strategies

Optimization Techniques:

Auto-scaling: Scale resources based on demand
Caching: Cache frequently requested URLs
Queue Management: Batch processing during off-peak hours
Tiered Processing: Different service levels for different users
Reserved Instances: Long-term commitments for cost savings
Spot Instances: Use cheaper compute for non-critical workloads

Social Media Scaling - International Expansion

A social media app is expanding from US to international regions. What are the things to keep in mind?

Clarifying Questions

Which regions are being targeted for expansion?
What's the current user base and infrastructure capacity?
Are there any specific features popular in target regions?
What's the budget and timeline for expansion?
Are there any regulatory compliance requirements?
What's the expected user growth in new regions?

Technical Considerations

Infrastructure Scaling:

Geographic Distribution:
- Deploy regional data centers for reduced latency
- Implement CDN for static content delivery
- Use edge computing for real-time features
Data Management:
- Data residency requirements (GDPR, local laws)
- Cross-region replication strategies
- Database sharding by geographic regions
Network Optimization:
- Optimize for different network conditions
- Implement adaptive bitrate for media content
- Handle varying connectivity patterns

Localization & Cultural Adaptation

Key Areas:

Language Support:
- Multi-language UI and content
- Right-to-left language support
- Character encoding for different scripts
Cultural Considerations:
- Local content moderation policies
- Regional social norms and preferences
- Local payment methods and currencies
Regulatory Compliance:
- Data privacy laws (GDPR, CCPA equivalents)
- Content regulation requirements
- Local business registration and taxes

Architecture Considerations

Global Load Balancer → Regional Clusters
        ↓                    ↓
    DNS Routing → [US, EU, APAC, LATAM]
        ↓                    ↓
    CDN Network → Regional Services
        ↓                    ↓
```

🚀 System Design Sample Reasonings

📚 Table of Contents

🤔 Clarifying Questions:

🎯 Main Issues & Analysis:

✅ Multi-Server Advantages:

❌ Multi-Server Disadvantages:

📋 Requirements Analysis:

🛠️ Recommended Architecture:

⚖️ Final Recommendation:

🤔 Clarifying Questions:

🚨 Concerns with Pure Consistent Hashing:

✅ Sample Reasonings & Improvements:

📋 System Requirements:

🛠️ Enhanced Architecture:

⚖️ Recommendation:

🤔 Clarifying Questions:

🚨 Potential Problems:

✅ Sample Reasonings:

📋 System Design Considerations:

🛠️ Recommended Architecture:

⚖️ Final Recommendation:

🤔 Clarifying Questions:

🚨 Problems with Current Approach:

✅ Optimized Approach:

📋 System Requirements:

🛠️ Enhanced Architecture:

⚖️ Implementation Strategy:

🤔 Clarifying Questions:

🎯 Core Challenge:

✅ Comprehensive Sample Reasoning:

📋 System Requirements:

🛠️ Architecture Components:

⚖️ Implementation Approach:

🤔 Clarifying Questions:

🚨 Major Problems:

✅ Improved Approach:

📋 Security Requirements:

🛠️ Security Architecture:

⚖️ Recommended Policy:

🤔 Clarifying Questions:

🚨 Issues with Current Rules:

✅ Enhanced Password Management System:

📋 System Requirements:

🛠️ Architecture Components:

⚖️ Improved Password Policy:

🤔 Clarifying Questions:

🎯 Estimation Approach:

✅ Estimation Framework:

📋 Storage Requirements:

🛠️ Storage Architecture:

⚖️ Cost Estimation (AWS pricing):

🤔 Information Needed:

🎯 Key Factors to Consider:

✅ Estimation Model:

📋 System Requirements:

🛠️ Storage Infrastructure:

⚖️ Cost Breakdown:

🤔 Clarifying Questions:

🚨 Major Concerns:

✅ Sample Reasonings & Mitigations:

📋 System Requirements:

🛠️ Architecture:

⚖️ Implementation Strategy:

🤔 Clarifying Questions:

🎯 System Design Considerations:

✅ Tap Card System Design:

📋 System Requirements:

🛠️ Technical Stack:

⚖️ Implementation Approach:

🤔 Clarifying Questions:

🚨 Current Problems:

✅ Immediate Workarounds:

📋 System Requirements:

🤔 Clarifying Questions:

🎯 Main Challenges:

✅ Scaling Sample Reasonings:

📋 System Requirements:

🛠️ Recommended Architecture:

⚖️ Implementation Strategy:

🤔 Clarifying Questions: