Disclaimer

The scenarios on this page are generalized system design case studies inspired by publicly discussed problems and common industry patterns. They are not tied to any specific company, interview process, or proprietary material. The reasoning presented represents one possible way to think about the problem; multiple valid designs may exist depending on constraints.

๐Ÿš€ System Design Sample Reasonings

Complete Study Guide - 25 Question Variants Across 15 Topics

๐ŸŽต Music Applications
Variant 1: Multi-Server Architecture
We have music application up and running. What are the advantages and disadvantages of running system on multiple servers?

๐Ÿค” Clarifying Questions:

  • What's the current user base and expected growth?
  • Are we talking about horizontal scaling or service decomposition?
  • What's the current bottleneck - CPU, storage, or network?
  • Do we need global distribution or single region?
  • What's the read/write ratio for music streaming vs uploads?

๐ŸŽฏ Main Issues & Analysis:

Current State: Single-server architecture hitting scalability limits

Core Challenge: Balancing scalability benefits with operational complexity

โœ… Multi-Server Advantages:

  • Horizontal Scalability: Handle more concurrent users by adding servers
  • High Availability: Eliminate single point of failure
  • Geographic Distribution: Reduce latency with CDN and edge servers
  • Load Distribution: Separate read-heavy (streaming) from write-heavy (uploads) workloads
  • Resource Optimization: Dedicated servers for different functions (streaming, metadata, user management)

โŒ Multi-Server Disadvantages:

  • Data Consistency: Eventual consistency challenges across replicas
  • Network Complexity: Inter-server communication overhead
  • Operational Overhead: Monitoring, deployment, and maintenance complexity
  • Cost: Higher infrastructure and operational costs
  • Distributed System Challenges: Network partitions, CAP theorem trade-offs

๐Ÿ“‹ Requirements Analysis:

Functional Requirements:

  • Stream music with low latency
  • Upload and store music files
  • User authentication and playlists
  • Search and discovery

Non-Functional Requirements:

  • Availability: 99.9% uptime
  • Scalability: Handle 10M+ concurrent users
  • Latency: <100ms for metadata, <2s for streaming start
  • Consistency: Strong for user data, eventual for music catalog

๐Ÿ› ๏ธ Recommended Architecture:

Load Balancer: NGINX/HAProxy for traffic distribution

Application Servers: Multiple instances behind load balancer

Database: Master-slave MySQL for metadata, sharded by user_id

File Storage: S3/GCS for music files with CDN

Cache: Redis for session management and popular content

โš–๏ธ Final Recommendation:

Go with multi-server architecture because:

  • Music streaming is inherently read-heavy and benefits from horizontal scaling
  • Global user base requires geographic distribution
  • The availability benefits outweigh complexity costs
  • Start with microservices: User Service, Music Service, Streaming Service
Variant 2: Consistent Hashing Load Distribution
Given a music streaming and uploading service, highly scalable system works on consistent hashing. Given that load is equally distributed on all servers. Do you see any concern with this architecture?

๐Ÿค” Clarifying Questions:

  • What's the hashing key - user_id, song_id, or something else?
  • Are we talking about data distribution or request routing?
  • What's the replication factor?
  • How do we handle hot users/songs?
  • What happens during server failures?

๐Ÿšจ Concerns with Pure Consistent Hashing:

Hot Data Problem: Popular songs/artists create uneven load despite equal distribution

Geographic Latency: Users may hit servers far from their location

Cache Efficiency: Related data scattered across servers

โœ… Sample Reasonings & Improvements:

  • Virtual Nodes: Each physical server handles multiple virtual nodes for better distribution
  • Load-Aware Routing: Monitor server load and route requests to less loaded servers
  • Hierarchical Hashing: Geographic clustering + consistent hashing within regions
  • Hot Data Replication: Replicate popular content to multiple servers
  • Adaptive Caching: Cache popular content at edge servers

๐Ÿ“‹ System Requirements:

Functional: Consistent data access, fault tolerance

Non-Functional: Low latency, high availability, even load distribution

This is a read-heavy system with occasional writes (uploads)

๐Ÿ› ๏ธ Enhanced Architecture:

Consistent Hashing Ring: With virtual nodes (150-200 per server)

Load Balancer: Weighted round-robin with health checks

CDN: CloudFront/CloudFlare for static music files

Monitoring: Real-time load monitoring and alerting

โš–๏ธ Recommendation:

Use enhanced consistent hashing with:

  • Virtual nodes for better distribution
  • Geographic awareness for latency
  • Hot data detection and replication
  • Load monitoring and adaptive routing
Variant 3: Distributed Song Storage
A music streaming service has the songs distributed across servers. What are the potential problems?

๐Ÿค” Clarifying Questions:

  • How are songs distributed - by artist, genre, or random?
  • What's the replication strategy?
  • How do we handle metadata vs audio files?
  • What's the failover mechanism?
  • How do we discover which server has a specific song?

๐Ÿšจ Potential Problems:

  • Single Point of Failure: If server with unique song goes down, song becomes unavailable
  • Metadata Inconsistency: Song metadata and audio files may be out of sync
  • Discovery Overhead: Need efficient way to locate songs across servers
  • Hot Partitions: Servers with popular songs get overloaded
  • Network Latency: Cross-server requests for playlists spanning multiple servers
  • Backup Complexity: Ensuring all songs are properly backed up

โœ… Sample Reasonings:

  • Replication Strategy: Minimum 3 replicas per song across different servers
  • Metadata Service: Centralized metadata store with song location mapping
  • Load Balancing: Intelligent routing to least loaded replica
  • Caching Strategy: Cache popular songs at edge servers
  • Health Monitoring: Real-time server health checks and automatic failover

๐Ÿ“‹ System Design Considerations:

Availability: 99.95% - music must be always accessible

Consistency: Eventual consistency for song catalog, strong for user preferences

Partition Tolerance: System must work despite network failures

Read-Heavy Workload: Optimize for fast reads, streaming

๐Ÿ› ๏ธ Recommended Architecture:

Metadata Store: MongoDB cluster with song-to-server mapping

File Storage: Distributed file system (HDFS) or object storage (S3)

Service Discovery: Consul/Eureka for server registry

Load Balancer: HAProxy with health checks

CDN: Multi-tier caching strategy

โš–๏ธ Final Recommendation:

Implement distributed storage with:

  • 3-way replication for high availability
  • Separate metadata and audio file storage
  • Intelligent load balancing and failover
  • CDN for popular content
๐Ÿ“ File Storage & Deduplication
Variant 1: File Size & Byte Comparison
User is uploading similar files to server. In order to save space, we check file size. If file size is same we compare byte by byte and then create symlink of new file to old file. What is problem with the system and any optimized approach?

๐Ÿค” Clarifying Questions:

  • What's the average file size and upload frequency?
  • Are files shared between users or private?
  • What's the storage budget and deduplication target?
  • Do we need to maintain separate file metadata per user?
  • What's the acceptable latency for uploads?

๐Ÿšจ Problems with Current Approach:

  • Performance Bottleneck: Byte-by-byte comparison is O(n) and CPU intensive
  • False Positives: Files with same size but different content will trigger comparison
  • Scalability Issues: Doesn't scale with large files or high upload frequency
  • Symlink Fragility: Symlinks break if original file is deleted
  • Metadata Confusion: Different users' files may have same content but different metadata

โœ… Optimized Approach:

  • Content Hashing: Use SHA-256 or MD5 hash as primary deduplication key
  • Chunked Hashing: For large files, use rolling hash (similar to rsync)
  • Reference Counting: Track how many users reference each unique file
  • Metadata Separation: Store file content once, metadata per user
  • Lazy Deletion: Only delete files when reference count reaches zero

๐Ÿ“‹ System Requirements:

Functional: Efficient deduplication, data integrity, user isolation

Non-Functional: Fast uploads, space efficiency, data consistency

Write-Heavy System: Optimize for quick file processing

๐Ÿ› ๏ธ Enhanced Architecture:

Hash Database: Redis for fast hash lookups

File Storage: Content-addressed storage (hash-based paths)

Metadata DB: PostgreSQL for user file metadata

Queue System: Async processing for hash computation

Key Insight: Move from comparing files to comparing hashes - O(1) lookup instead of O(n) comparison

โš–๏ธ Implementation Strategy:

  1. Hash Generation: Compute SHA-256 during upload
  2. Hash Lookup: Check if hash exists in database
  3. Reference Management: Increment reference count or store new file
  4. Metadata Storage: Always store per-user metadata separately
Variant 2: General File Deduplication
File system - users upload files, sometimes they upload duplicate files. Find way to not store duplicate files.

๐Ÿค” Clarifying Questions:

  • What's the scale - number of users and files?
  • Are duplicates within same user or across users?
  • What file types and sizes are we dealing with?
  • Do we need to preserve user-specific metadata?
  • What's the privacy requirement - can users access others' files?

๐ŸŽฏ Core Challenge:

Duplicate Detection: Efficiently identify identical files across large dataset

Privacy Concerns: Users shouldn't access each other's files even if identical

Metadata Handling: Same content, different names/permissions per user

โœ… Comprehensive Sample Reasoning:

  • Content-Addressable Storage: Store files by content hash
  • Multi-Level Deduplication:
    • File-level: Complete file deduplication
    • Block-level: Chunk-based deduplication for large files
  • Virtual File System: User sees their own file tree, backend stores deduplicated content
  • Reference Tracking: Maintain reference count per content hash

๐Ÿ“‹ System Requirements:

Scalability: Handle millions of files efficiently

Privacy: Strong user isolation

Consistency: Strong consistency for user data

Space Efficiency: Maximize storage savings

๐Ÿ› ๏ธ Architecture Components:

Content Store: S3/GCS with content-addressable paths

Metadata DB: PostgreSQL with user file mappings

Hash Index: Redis for fast content hash lookups

Processing Queue: Kafka for async deduplication

Block Storage: For chunk-level deduplication

Architecture Pattern: Separate content storage from user namespace - one content store, multiple user views

โš–๏ธ Implementation Approach:

  1. Upload Process: Hash โ†’ Check existence โ†’ Store/Reference
  2. User Interface: Virtual file system with user-specific metadata
  3. Garbage Collection: Periodic cleanup of unreferenced content
  4. Monitoring: Track deduplication ratio and storage savings
๐Ÿ”’ Password Validation
Variant 1: English Word Restriction
Password validation. Password needs to be English word 8-16 chars. 1 upper case and 1 symbol. What is the problem with the approach?

๐Ÿค” Clarifying Questions:

  • Are we allowing dictionary words or requiring non-dictionary words?
  • What's the target user base - global or English-speaking only?
  • What's the security threat model we're protecting against?
  • Are there regulatory compliance requirements?
  • What's the acceptable user experience for password creation?

๐Ÿšจ Major Problems:

  • Dictionary Attack Vulnerability: English words are easily guessable
  • Reduced Entropy: Limited to ~170,000 English words vs random combinations
  • Predictable Patterns: Users will add symbols/caps in predictable ways
  • Cultural Bias: Excludes non-English speakers
  • Brute Force Weakness: Much smaller search space for attackers

โœ… Improved Approach:

  • Entropy-Based Validation: Minimum 50+ bits of entropy
  • Blacklist Common Passwords: Check against known breach databases
  • Passphrase Support: Allow multiple words with spaces
  • Multi-Factor Authentication: Reduce password burden with 2FA
  • Password Strength Meter: Real-time feedback to users

๐Ÿ“‹ Security Requirements:

Functional: Strong authentication, user-friendly validation

Non-Functional: High security, good UX, regulatory compliance

Threat Model: Protect against credential stuffing, brute force, social engineering

๐Ÿ› ๏ธ Security Architecture:

Password Hashing: bcrypt/scrypt/Argon2 with salt

Breach Database: HaveIBeenPwned API integration

Rate Limiting: Prevent brute force attempts

Audit Logging: Track authentication attempts

Key Principle: Security through entropy, not obscurity. Random characters > predictable patterns

โš–๏ธ Recommended Policy:

  • Minimum 12 characters OR high entropy score
  • No dictionary word restrictions - check against breach databases instead
  • Support passphrases - multiple words are stronger than complex single words
  • Mandatory 2FA for high-value accounts
Variant 2: Password Management System
You are designing a password management system. The rules are: The password must not contain any English words. The password must be between 8-16 letters and must contain 1 of Uppercase letter, lowercase letter and a symbol. Do you see any issues with this? What would you change to improve it?

๐Ÿค” Clarifying Questions:

  • Are we storing passwords for users or generating them?
  • What's the target security level - enterprise or consumer?
  • Do we need to support password sharing between users?
  • What's the master password policy?
  • Are there integration requirements with other systems?

๐Ÿšจ Issues with Current Rules:

  • Length Limitation: 16 char max is too restrictive for strong passwords
  • Composition Rules: Rigid requirements reduce actual entropy
  • English-Only Focus: Doesn't consider other languages
  • User Experience: Difficult to create compliant passwords
  • False Security: Complex rules don't guarantee strong passwords

โœ… Enhanced Password Management System:

  • Password Generation: Automatic generation of high-entropy passwords
  • Flexible Length: Support passwords up to 256 characters
  • Entropy Measurement: Real-time entropy calculation
  • Zero-Knowledge Architecture: Client-side encryption/decryption
  • Secure Storage: End-to-end encryption with user master key

๐Ÿ“‹ System Requirements:

Functional: Store, generate, and auto-fill passwords

Non-Functional: Zero-knowledge security, high availability, cross-platform

Security Model: Even the service provider cannot access user passwords

๐Ÿ› ๏ธ Architecture Components:

Client Apps: Browser extensions, mobile apps with local crypto

Encryption: AES-256 with PBKDF2/scrypt for key derivation

Backend: Store encrypted blobs, no access to plaintext

Sync: Secure multi-device synchronization

Backup: Encrypted backup with recovery keys

Core Principle: Generate strong passwords automatically, don't force users to create weak ones manually

โš–๏ธ Improved Password Policy:

  • Auto-Generate: Default to 20+ character random passwords
  • Flexible Rules: Allow sites to specify their own requirements
  • Entropy-Based: Measure actual password strength
  • User Choice: Support both generated and custom passwords
  • Secure by Default: Strongest settings as default
๐Ÿ’พ Data Storage & Cost Estimation
Variant 1: Logging System Storage
We have system running for logging. How would you estimate data storage cost for next 1 year?

๐Ÿค” Clarifying Questions:

  • What's the current daily/monthly log volume?
  • What types of logs - application, access, error, audit?
  • What's the retention policy and compliance requirements?
  • Are logs structured (JSON) or unstructured (text)?
  • What's the expected business growth rate?
  • Current storage infrastructure - cloud or on-premise?

๐ŸŽฏ Estimation Approach:

Data Collection: Establish baseline metrics

Growth Modeling: Account for business and technical growth

Storage Tiers: Consider different storage classes

โœ… Estimation Framework:

  • Current Metrics:
    • Average log entry size (e.g., 200 bytes)
    • Logs per second per server (e.g., 50/sec)
    • Number of servers (e.g., 100)
    • Daily volume = 100 servers ร— 50 logs/sec ร— 86400 sec ร— 200 bytes = 8.64 GB/day
  • Growth Factors:
    • Business growth: 50% more users/year
    • Infrastructure growth: 30% more servers
    • Feature growth: 20% more logging
  • Storage Tiers:
    • Hot (0-7 days): Frequent access, SSD storage
    • Warm (7-90 days): Occasional access, standard storage
    • Cold (90+ days): Archive, glacier storage

๐Ÿ“‹ Storage Requirements:

Functional: Reliable storage, fast search, compliance

Non-Functional: Cost-effective, scalable, durable

Write-Heavy System: Optimize for high write throughput

๐Ÿ› ๏ธ Storage Architecture:

Hot Tier: Elasticsearch cluster for search

Warm Tier: S3 Standard for occasional access

Cold Tier: S3 Glacier for long-term retention

Processing: Kafka for real-time ingestion

Compression: Gzip compression (3-5x reduction)

Sample Calculation:
Current: 8.64 GB/day ร— 365 days = 3.15 TB/year
With growth: 3.15 TB ร— 1.5 ร— 1.3 ร— 1.2 = 7.37 TB/year
With compression: 7.37 TB รท 4 = 1.84 TB actual storage

โš–๏ธ Cost Estimation (AWS pricing):

  • Hot Storage (7 days): 36 GB ร— $0.23/GB = $8.28/month
  • Warm Storage (83 days): 430 GB ร— $0.023/GB = $9.89/month
  • Cold Storage (275 days): 1.37 TB ร— $0.004/GB = $5.48/month
  • Total Monthly: ~$24/month = $288/year
  • With 20% buffer: $346/year
Variant 2: Photo Storage Service
We are running a simple photo storage and sharing service. People upload their photos to our servers and then give links to other users who can then view them. Instead of using a cloud service, we have our own server farms. You've been tasked with creating an estimate of the storage required over the coming year and the cost of that storage. What information would you need and what factors would you consider as you generate this estimate?

๐Ÿค” Information Needed:

  • User Metrics: Current users, growth rate, user behavior patterns
  • Photo Characteristics: Average file size, formats, resolution
  • Usage Patterns: Upload frequency, sharing ratio, retention
  • Business Model: Storage limits, premium features, monetization
  • Technical Constraints: Bandwidth, server capacity, budget

๐ŸŽฏ Key Factors to Consider:

  • Growth Patterns: User acquisition, seasonal variations
  • Storage Overhead: Metadata, thumbnails, redundancy
  • Hardware Lifecycle: Replacement cycles, capacity planning
  • Operational Costs: Power, cooling, maintenance

โœ… Estimation Model:

  • Baseline Metrics:
    • Current users: 100K
    • Average photo size: 3MB
    • Photos per user per month: 50
    • Monthly upload volume: 100K ร— 50 ร— 3MB = 15TB
  • Growth Projections:
    • User growth: 20% monthly
    • Usage growth: 10% monthly (more engagement)
    • Photo size growth: 5% monthly (better cameras)
  • Storage Overhead:
    • Thumbnails: 50KB per photo
    • Metadata: 1KB per photo
    • Redundancy: 3x replication
    • Total multiplier: 3.5x

๐Ÿ“‹ System Requirements:

Functional: Store photos, generate thumbnails, share links

Non-Functional: High availability, fast access, cost-effective

Read-Heavy System: Optimize for fast photo retrieval

๐Ÿ› ๏ธ Storage Infrastructure:

Hot Storage: SSDs for recent photos (30 days)

Warm Storage: HDDs for older photos (6 months)

Cold Storage: Tape backup for long-term retention

CDN: Edge caching for popular photos

Compression: Lossless compression for archival

Year-End Projection:
Month 1: 15TB โ†’ Month 12: 15TB ร— (1.2 ร— 1.1 ร— 1.05)^11 = 435TB
Total yearly storage: ~2.4PB
With overhead: 2.4PB ร— 3.5 = 8.4PB

โš–๏ธ Cost Breakdown:

  • Hardware: 8.4PB ร— $50/TB = $420,000
  • Infrastructure: Servers, networking = $150,000
  • Operational: Power, cooling, maintenance = $180,000/year
  • Personnel: DevOps, support = $300,000/year
  • Total: $1,050,000 first year
๐ŸŽฎ Arcade/Gaming Systems
Variant 1: Card Sensor System
We have card for arcade games which you need to recharge to use or you can link with credit card. The sensor on the machine read this card and allow user to play if has enough money. We will add this sensor system to 125,000 machines. What are the concerns do you have in this approach?

๐Ÿค” Clarifying Questions:

  • What's the connectivity model - online or offline capable?
  • How do we handle network failures and payments?
  • What's the fraud prevention and security model?
  • Are cards shared between different arcade locations?
  • What's the expected transaction volume per machine?

๐Ÿšจ Major Concerns:

  • Network Reliability: 125K machines need constant connectivity
  • Payment Processing: Credit card transactions at scale
  • Fraud Prevention: Card cloning, charge-back protection
  • Offline Capability: Machines must work during network outages
  • Data Synchronization: Balance updates across all machines
  • Security: PCI compliance, encryption, tamper resistance

โœ… Sample Reasonings & Mitigations:

  • Hybrid Architecture: Online primary, offline backup mode
  • Local Balance Cache: Store encrypted balance on card
  • Batch Processing: Queue transactions during outages
  • Distributed Validation: Machine-to-machine verification
  • Fraud Detection: Real-time anomaly detection
  • Secure Hardware: HSM for encryption keys

๐Ÿ“‹ System Requirements:

Functional: Process payments, track balances, prevent fraud

Non-Functional: 99.9% uptime, PCI compliance, low latency

Scale: 125K machines, millions of transactions daily

๐Ÿ› ๏ธ Architecture:

Central System: Microservices for payment processing

Edge Computing: Local servers in each arcade

Card Technology: NFC with secure element

Connectivity: 4G/5G with WiFi backup

Database: Distributed database with eventual consistency

Key Insight: Design for offline-first operation - machines must work even when disconnected

โš–๏ธ Implementation Strategy:

  • Pilot Program: Start with 1,000 machines to validate
  • Gradual Rollout: Phase deployment to manage risk
  • Fallback Plan: Maintain token-based system as backup
  • Security First: Implement end-to-end encryption
Variant 2: Tap Card Arcade
Tap card arcade

๐Ÿค” Clarifying Questions:

  • What's the specific problem - designing the system or troubleshooting?
  • How many locations and machines are involved?
  • What's the current payment flow and technology stack?
  • Are there specific performance or reliability issues?
  • What's the budget and timeline for implementation?

๐ŸŽฏ System Design Considerations:

User Experience: Fast, reliable tap-to-play experience

Technical Challenges: NFC reliability, payment processing, balance management

Business Requirements: Revenue tracking, fraud prevention, customer retention

โœ… Tap Card System Design:

  • NFC Cards: Contactless payment with secure element
  • Dual Storage: Balance on card + central database
  • Instant Response: <200ms transaction time
  • Offline Mode: Local balance validation
  • Real-time Sync: Background synchronization

๐Ÿ“‹ System Requirements:

Functional: Tap-to-pay, balance management, game activation

Non-Functional: Sub-second response, 99.95% availability

Transaction-Heavy: Optimize for high-frequency small transactions

๐Ÿ› ๏ธ Technical Stack:

Cards: NFC-enabled smart cards with secure element

Readers: NFC readers with tamper detection

Gateway: Local payment gateway per location

Backend: Cloud-based balance management system

Analytics: Real-time transaction monitoring

Design Principle: Optimize for user experience - every tap should result in immediate game activation

โš–๏ธ Implementation Approach:

  • Card-First: Prioritize offline capability
  • Redundancy: Multiple validation methods
  • Monitoring: Real-time alerts for system issues
  • Scalability: Design for peak usage periods
๐ŸŽฅ Video Processing
Variant 1: Subtitle Generation Service
We are working on a service that generates subtitles for users' videos. This process starts a new thread for every video and is processor-intensive. Currently, this service runs as a single process on a machine. We've run into a bug where if the service is processing more than 10 videos at the same time, the service crashes the server, losing all requests currently being processed and affecting other processes on the machine. It may take a long time to find and fix this bug. What workarounds could we implement to continue running the service while we do?

๐Ÿค” Clarifying Questions:

  • What's the average video processing time?
  • What's the current request volume and expected growth?
  • Are there resource constraints - CPU, memory, or disk?
  • What's the acceptable processing delay for users?
  • Can we use multiple machines or must we work with current setup?

๐Ÿšจ Current Problems:

  • Resource Exhaustion: Too many concurrent threads causing crashes
  • Work Loss: All progress lost when service crashes
  • System Impact: Crashes affect other processes on same machine
  • No Graceful Degradation: Hard limit causing complete failure
  • No Persistence: No way to resume interrupted work

โœ… Immediate Workarounds:

  • Concurrency Limit: Implement thread pool with max 8 threads
  • Request Queue: Queue incoming requests, process FIFO
  • Process Isolation: Run service in containerized environment
  • Checkpointing: Save progress periodically to resume work
  • Circuit Breaker: Stop accepting new requests when overloaded
  • Resource Monitoring: Monitor CPU/memory and throttle accordingly

๐Ÿ“‹ System Requirements:

Functional: Generate subtitles, handle concurrent requests

Non-Functional: Fault tolerance, no work loss, system stability

CPU-Intensive: Optimize for computational efficiency

Video Processing - Subtitle Generation Service
We are working on a service that generates subtitles for users' videos. This process starts a new thread for every video and is processor-intensive. Currently, this service runs as a single process on a machine. We've run into a bug where if the service is processing more than 10 videos at the same time, the service crashes the server, losing all requests currently being processed and affecting other processes on the machine. It may take a long time to find and fix this bug. What workarounds could we implement to continue running the service while we do?
Clarifying Questions
  • What's the average video length and processing time?
  • What's the current throughput requirement (videos per hour)?
  • Are there any SLA requirements for subtitle generation?
  • Can we process videos asynchronously or does it need to be real-time?
  • What's the server specification and resource utilization?
  • Are there any budget constraints for additional infrastructure?
Main Sample Reasoning

Immediate Workarounds:

  1. Resource Isolation with Containers: Run each video processing task in a separate Docker container with resource limits (CPU, memory). This prevents one task from crashing the entire system.
  2. Process Queue with Circuit Breaker: Implement a queue system (Redis/RabbitMQ) that limits concurrent processing to 8-9 videos (below the crash threshold). Use circuit breaker pattern to prevent system overload.
  3. Horizontal Scaling: Deploy multiple instances of the service across different machines, each handling a subset of the load.
  4. Graceful Degradation: Implement health checks and automatic service restart mechanisms to minimize downtime.

Architecture:

Load Balancer โ†’ Queue System โ†’ Worker Nodes (Containerized) โ†“ โ†“ โ†“ API Gateway โ†’ Redis Queue โ†’ [Worker1, Worker2, Worker3] โ†“ Database (Job Status)
NFRs & FRs

Functional Requirements:

  • Generate subtitles for uploaded videos
  • Support multiple video formats
  • Provide job status tracking
  • Handle video upload and subtitle download

Non-Functional Requirements:

  • Reliability: 99.9% uptime, fault tolerance
  • Scalability: Handle varying loads, auto-scaling
  • Performance: Process within reasonable time limits
  • Availability: Service should remain available during peak loads
  • Consistency: Eventual consistency for job status
Trade-offs & Considerations
Approach Pros Cons
Containerization Resource isolation, easy deployment Overhead, complexity
Queue System Controlled processing, fault tolerance Additional infrastructure, latency
Horizontal Scaling Higher throughput, redundancy Higher costs, coordination complexity
Vending Machine - Internet-Connected Infrastructure
You're working on infrastructure for internet-connected vending machines. The plan is to install around 188,888 of these vending machines in the coming year, in major cities around the world. These machines will connect to the internet through cellular network. Each machine will connect to a central server at midnight to report remaining stock and any maintenance issues like coin jams or stuck items. These machine status updates will be stored in a database, and a batch job will run at 1 AM to schedule the restocking and maintenance of machines. Are there any problems with the above design? How to solve these problems?
Clarifying Questions
  • What's the data size for each machine's status report?
  • Are there any real-time requirements for urgent maintenance issues?
  • What's the acceptable downtime for maintenance scheduling?
  • Are there regional compliance requirements for data storage?
  • What's the budget for cellular connectivity and server infrastructure?
  • Do machines need to support offline operation?
Problems Identified

Major Issues:

  1. Thundering Herd Problem: All 188,888 machines connecting simultaneously at midnight will overwhelm the server
  2. Single Point of Failure: Central server failure affects all machines globally
  3. Time Zone Complexity: "Midnight" varies across global locations
  4. Network Congestion: Cellular networks may struggle with simultaneous connections
  5. Database Bottleneck: Batch processing of ~189K records at once
  6. Maintenance Scheduling Delay: 1-hour gap between reporting and scheduling
Proposed Sample Reasonings

Architecture Improvements:

  1. Distributed Regional Architecture:
    • Regional data centers with local servers
    • Machines connect to nearest regional server
    • Data replication between regions
  2. Staggered Reporting Schedule:
    • Distribute connections across 2-3 hour window
    • Use machine ID hash to determine reporting slot
    • Implement exponential backoff for failed connections
  3. Event-Driven Processing:
    • Real-time processing for critical issues
    • Stream processing for continuous updates
    • Batch processing for bulk operations
Machines โ†’ Regional Load Balancer โ†’ Regional Server Cluster โ†“ โ†“ โ†“ Message Queue โ†’ Stream Processor โ†’ Database Cluster โ†“ โ†“ โ†“ Priority Queue โ†’ Maintenance Scheduler โ†’ Notification Service
NFRs & FRs

Functional Requirements:

  • Collect machine status reports (inventory, maintenance issues)
  • Schedule restocking and maintenance
  • Handle global deployment across time zones
  • Support offline operation and data synchronization

Non-Functional Requirements:

  • Scalability: Handle 188K+ machines with linear growth
  • Availability: 99.9% uptime, regional redundancy
  • Performance: Handle concurrent connections efficiently
  • Consistency: Eventual consistency for non-critical updates
  • Reliability: Message delivery guarantees, retry mechanisms
Trade-offs & Considerations
Sample Reasoning Pros Cons
Regional Architecture Reduced latency, fault isolation Higher complexity, data consistency challenges
Staggered Reporting Smooth load distribution Delayed insights, implementation complexity
Stream Processing Real-time insights, better resource utilization Higher infrastructure costs
Mobile Game Analysis - Go Game Analysis
We are working on a mobile app for the board game Go. We'd like to add a feature where the computer will analyze a complete game. The analysis looks at each position from the game and provides suggested moves to help improve our users' play. We've found a library we can use to do this analysis. It takes an average of a minute on a modern desktop computer to analyze entire game. An average game consists of about 200 moves. We are considering two approaches: 1) running this analysis on the phone itself, and 2) sending the game to a server farm for analysis that will be returned to the user. What are some advantages or disadvantages of each approach?
Clarifying Questions
  • What's the target device specification range (low-end to high-end)?
  • How frequently do users request game analysis?
  • Are there any real-time requirements for analysis results?
  • What's the acceptable battery drain for mobile processing?
  • Are there any privacy concerns with sending game data to servers?
  • What's the expected user base and concurrent analysis requests?
Comparative Analysis
Aspect Mobile Processing Server Processing
Performance Varies by device (2-10 minutes) Consistent (30-60 seconds)
Battery Impact High CPU usage, significant drain Minimal, just network activity
Network Dependency None required Requires stable internet
Privacy Complete data privacy Data transmitted to servers
Cost No ongoing costs Server infrastructure costs
Scalability Scales with user devices Requires capacity planning
Recommended Hybrid Approach

Adaptive Processing Strategy:

  1. Device Classification:
    • High-end devices: Mobile processing with user choice
    • Mid-range devices: Server processing by default
    • Low-end devices: Server processing only
  2. Progressive Analysis:
    • Quick analysis (key moves) on mobile
    • Detailed analysis on server
    • Cached results for common patterns
  3. Intelligent Queueing:
    • Priority queues for paying users
    • Background processing for free users
    • Load balancing across server clusters
NFRs & FRs

Functional Requirements:

  • Analyze complete Go games (200+ moves)
  • Provide move suggestions and position evaluation
  • Support both mobile and server processing
  • Handle various game formats and rule sets

Non-Functional Requirements:

  • Performance: Analysis completion within 2-5 minutes
  • Scalability: Handle thousands of concurrent analyses
  • Availability: 99.5% uptime for server processing
  • Usability: Seamless user experience across devices
  • Efficiency: Minimal battery drain on mobile devices
Architecture Considerations
Mobile App โ†’ Load Balancer โ†’ Analysis Service Cluster โ†“ โ†“ โ†“ Device Check โ†’ Queue Manager โ†’ [Worker Nodes] โ†“ โ†“ โ†“ Local/Remote โ†’ Priority Queue โ†’ Result Cache Decision โ†“ โ†“ Database โ†’ Push Notification Service

Key Components:

  • Analysis Engine: Containerized Go analysis library
  • Queue System: Redis/RabbitMQ for job management
  • Caching Layer: Redis for common game patterns
  • Notification Service: Push notifications for completion
Photo Sharing - Alphabetical Username Sharding
We are running a simple photo storage and sharing service. People upload their photos to our servers and then give links to other users who can then view them. We're trying to figure out how to split the photos and associated data evenly onto multiple machines, especially as we get new users. We've decided to shard the photos evenly alphabetically by username. For example, if we had 26 servers, all the usernames starting with 'a' would be on server 1, usernames starting with 'b' would be on server 2, and so on. We have created a scheme like this that will work for any number of servers. Are there any problems with this design? How to solve these?
Clarifying Questions
  • What's the expected user growth rate and photo upload frequency?
  • Are there any geographic distribution requirements?
  • What's the average photo size and storage requirements?
  • Are there any requirements for data replication or backups?
  • What's the read vs write ratio for the application?
  • Are there any compliance requirements for data location?
Problems with Alphabetical Sharding

Major Issues:

  1. Uneven Distribution: Names starting with certain letters are more common (e.g., 'S', 'M', 'C') leading to hotspots
  2. Predictable Patterns: Usernames often follow patterns (company names, common words) causing skewed distribution
  3. Limited Scalability: Adding new servers requires resharding significant portions of data
  4. Cultural Bias: Distribution varies significantly across languages and cultures
  5. Gaming Vulnerability: Users could exploit the system by choosing usernames strategically
Better Sharding Strategies

Recommended Approaches:

  1. Consistent Hashing:
    • Hash username to get uniform distribution
    • Easy to add/remove servers with minimal data movement
    • Use SHA-256 or similar for even distribution
  2. Range-based Sharding with Monitoring:
    • Monitor shard sizes and rebalance when needed
    • Use consistent hashing for automatic rebalancing
    • Implement shard splitting when threshold is reached
  3. Hybrid Approach:
    • Use consistent hashing for user data
    • Separate sharding strategy for photos (by ID or date)
    • Implement cross-references for data location
Username โ†’ Hash Function โ†’ Shard Selection โ†“ โ†“ โ†“ "john_doe" โ†’ SHA-256 โ†’ hash % num_shards = shard_id โ†“ โ†“ โ†“ Load Balancer โ†’ Consistent Hash Ring โ†’ Target Server
NFRs & FRs

Functional Requirements:

  • Store and retrieve photos for users
  • Generate shareable links for photos
  • Support user authentication and authorization
  • Handle photo metadata and organization

Non-Functional Requirements:

  • Scalability: Handle millions of users and photos
  • Availability: 99.9% uptime for photo access
  • Performance: Fast photo upload/download times
  • Consistency: Eventual consistency for non-critical metadata
  • Durability: Photos should never be lost
Architecture Design
Client โ†’ CDN โ†’ Load Balancer โ†’ API Gateway โ†“ โ†“ โ†“ โ†“ Upload โ†’ Cache โ†’ Auth Service โ†’ User Service โ†“ โ†“ โ†“ โ†“ Storage โ†’ Metadata DB โ†’ Consistent Hash Ring โ†“ โ†“ โ†“ โ†“ [Photo Storage Shards] โ†’ [User Data Shards]

Key Components:

  • Consistent Hash Ring: Even distribution and easy scaling
  • Photo Storage: Separate sharding by photo ID or date
  • Metadata Database: Stores user-photo relationships
  • CDN: Global distribution for faster access
  • Replication: Multiple copies for durability
Crossword Puzzle - Hints Storage Strategy
Given a crossword puzzle gaming application which is giving hints to users, what are the advantages and disadvantages of both following approaches: -fetching hints from server -preloading hints on device
Clarifying Questions
  • What's the total size of all hints data?
  • How frequently are hints updated or new puzzles added?
  • Are there different difficulty levels requiring different hint sets?
  • What's the target device storage capacity?
  • Are there offline usage requirements?
  • What's the user engagement pattern (daily vs occasional)?
Comparative Analysis
Aspect Server Fetching Device Preloading
Storage Minimal device storage Significant storage required
Network Requires internet connection Only for updates
Performance Network latency for each request Instant access
Updates Real-time updates possible Requires app updates
Data Freshness Always up-to-date May be stale
Offline Support Not available Full offline capability
Bandwidth Usage Continuous small requests Large initial download
Server Load High, scales with users Low, mainly for updates
Recommended Hybrid Approach

Intelligent Caching Strategy:

  1. Tiered Storage:
    • Core hints (most common) preloaded on device
    • Extended hints fetched on-demand
    • Personalized hints based on user behavior
  2. Predictive Caching:
    • Cache hints for puzzles likely to be played
    • Background sync during Wi-Fi connection
    • User preference-based caching
  3. Adaptive Strategy:
    • Monitor device storage and network conditions
    • Adjust caching strategy based on usage patterns
    • Implement cache expiration and cleanup
NFRs & FRs

Functional Requirements:

  • Provide hints for crossword puzzles
  • Support offline usage
  • Handle frequent updates and new puzzles
  • Optimize for various device storage capacities

Non-Functional Requirements:

  • Performance: Instant hint access for preloaded hints
  • Scalability: Handle millions of users and puzzles
  • Availability: 99.9% uptime for server fetching
  • Usability: Seamless user experience across devices
  • Efficiency: Minimal device storage usage
Architecture Design
App Launch โ†’ Check Cache โ†’ Fetch Missing Hints โ†“ โ†“ โ†“ Device Storage โ†’ Local DB โ†’ Server API โ†“ โ†“ โ†“ Cache Manager โ†’ Sync Service โ†’ Background Updates
NFRs & FRs

Functional Requirements:

  • Provide hints for crossword puzzles
  • Support multiple difficulty levels
  • Handle hint categories and tags
  • Support offline gameplay

Non-Functional Requirements:

  • Performance: Instant hint delivery (< 100ms)
  • Availability: 99.5% uptime for hint service
  • Scalability: Support millions of concurrent users
  • Efficiency: Minimal bandwidth and storage usage
  • Reliability: Consistent hint quality and accuracy
Implementation Strategy

Cache Management:

  • LRU Cache: Keep most recently used hints
  • Size Limits: Configurable cache size based on device
  • Background Sync: Update cache during idle time
  • Compression: Reduce storage footprint

Network Optimization:

  • Batch Requests: Fetch multiple hints together
  • Delta Updates: Only sync changed hints
  • CDN Integration: Global distribution for faster access
  • Fallback Mechanisms: Graceful degradation without hints
XML Processing - Large File Processing
A huge XML file with sales data needs to be processed. It is huge enough that it cannot be loaded at once given the RAM limitation of the local system. How can we process it?
Clarifying Questions
  • What's the approximate file size and RAM limitations?
  • What type of processing is required (aggregation, transformation, filtering)?
  • Are there any real-time processing requirements?
  • Is the XML structure known and consistent?
  • Can the file be preprocessed or split beforehand?
  • Are there any accuracy requirements (100% vs approximate)?
Processing Strategies

Streaming-Based Approaches:

  1. SAX Parser (Event-driven):
    • Process XML elements as they are read
    • Memory usage remains constant
    • Suitable for sequential processing
  2. StAX Parser (Pull-based):
    • More control over parsing process
    • Can pause/resume processing
    • Better for complex logic
  3. Custom Chunking:
    • Split file into smaller chunks
    • Process each chunk independently
    • Merge results at the end
Large XML File โ†’ Stream Parser โ†’ Processing Logic โ†“ โ†“ โ†“ File Reader โ†’ Event Handler โ†’ Aggregator/Transformer โ†“ โ†“ โ†“ Memory Buffer โ†’ Partial Results โ†’ Final Output
Detailed Implementation

SAX Parser Implementation:

// Pseudo-code for SAX processing class SalesDataHandler extends SAXHandler { private aggregator = new DataAggregator(); private currentRecord = new SalesRecord(); onStartElement(element) { if (element == "sale") { currentRecord = new SalesRecord(); } } onEndElement(element) { if (element == "sale") { aggregator.process(currentRecord); currentRecord = null; // Free memory } } onCharacters(data) { currentRecord.addData(data); } }

Alternative Approaches:

  • MapReduce: For distributed processing across multiple machines
  • Database Streaming: Load data directly into database using bulk operations
  • External Sorting: For operations requiring sorted data
NFRs & FRs

Functional Requirements:

  • Process large XML files without loading entirely into memory
  • Support various processing operations (aggregation, filtering, transformation)
  • Handle malformed XML gracefully
  • Provide progress tracking and resumability

Non-Functional Requirements:

  • Memory Efficiency: Constant memory usage regardless of file size
  • Performance: Process files in reasonable time
  • Scalability: Handle files of any size
  • Reliability: Recover from processing errors
  • Accuracy: Maintain data integrity during processing
Performance Optimizations
Technique Memory Usage Processing Speed Complexity
SAX Parser Very Low Fast Medium
StAX Parser Low Medium High
File Chunking Medium Variable Low
Parallel Processing Medium Very Fast High
URL Processing - Smart Engine Service Budget
Given a smart engine service which takes URLs from users and processes out some useful data from it. You have to plan the budget of this project. What things you will take into consideration? Expectations were basically, what things you will ask client to figure out capacity estimation parameters.
Clarifying Questions
  • What's the expected number of URLs processed per day/hour?
  • What's the average processing time per URL?
  • What type of data extraction is required (text, images, metadata)?
  • Are there any SLA requirements for processing time?
  • What's the expected user base and growth trajectory?
  • Are there any compliance or legal requirements?
  • What's the acceptable downtime and error rates?
  • Are there any geographic distribution requirements?
Capacity Estimation Parameters

Core Metrics to Gather:

  1. Traffic Patterns:
    • Peak requests per second (RPS)
    • Average requests per day
    • Seasonal variations and growth projections
    • Geographic distribution of users
  2. Processing Requirements:
    • Average URL response time and size
    • Processing complexity (CPU, memory, I/O intensive)
    • Storage requirements for processed data
    • Caching potential and hit rates
  3. Quality Requirements:
    • Availability SLA (99.9%, 99.99%)
    • Response time requirements
    • Error tolerance levels
    • Data retention policies
Budget Components

Infrastructure Costs:

Component Cost Factors Estimation Method
Compute Resources CPU, Memory, Number of instances RPS ร— Processing time ร— Resource requirements
Storage Data size, Retention period, Replication Daily data ร— Retention days ร— Replication factor
Network Bandwidth, CDN, Data transfer Request size ร— RPS ร— Geographic distribution
Database Read/Write operations, Storage Query complexity ร— Transaction volume
Monitoring Metrics, Logging, Alerting 5-10% of total infrastructure cost
Sample Calculation

Example Scenario:

Assumptions: - 1M URLs per day (average) - 10 seconds average processing time per URL - Peak traffic: 3x average (during business hours) - 99.9% availability requirement - 30-day data retention Calculations: - Peak RPS: 1M / (24*3600) * 3 = ~35 RPS - Concurrent processing: 35 * 10 = 350 concurrent jobs - Server capacity: 350 jobs / 10 jobs per server = 35 servers - With redundancy (2x): 70 servers - Storage: 1M URLs * 50KB per result * 30 days = 1.5TB - Network: 1M * 100KB average page size = 100GB/day

Cost Breakdown (Monthly):

  • Compute: 70 servers ร— $100/month = $7,000
  • Storage: 1.5TB ร— $50/TB = $75
  • Network: 3TB ร— $20/TB = $60
  • Database: $500 (managed service)
  • Monitoring: $400
  • Total: ~$8,000/month
NFRs & FRs

Functional Requirements:

  • Accept URLs from users for processing
  • Extract useful data from web pages
  • Store and provide processed results
  • Handle various content types and formats

Non-Functional Requirements:

  • Scalability: Handle varying loads efficiently
  • Performance: Process URLs within acceptable timeframes
  • Availability: Meet SLA requirements
  • Cost Efficiency: Optimize resource utilization
  • Reliability: Consistent service quality
Cost Optimization Strategies

Optimization Techniques:

  • Auto-scaling: Scale resources based on demand
  • Caching: Cache frequently requested URLs
  • Queue Management: Batch processing during off-peak hours
  • Tiered Processing: Different service levels for different users
  • Reserved Instances: Long-term commitments for cost savings
  • Spot Instances: Use cheaper compute for non-critical workloads
Social Media Scaling - International Expansion
A social media app is expanding from US to international regions. What are the things to keep in mind?
Clarifying Questions
  • Which regions are being targeted for expansion?
  • What's the current user base and infrastructure capacity?
  • Are there any specific features popular in target regions?
  • What's the budget and timeline for expansion?
  • Are there any regulatory compliance requirements?
  • What's the expected user growth in new regions?
Technical Considerations

Infrastructure Scaling:

  1. Geographic Distribution:
    • Deploy regional data centers for reduced latency
    • Implement CDN for static content delivery
    • Use edge computing for real-time features
  2. Data Management:
    • Data residency requirements (GDPR, local laws)
    • Cross-region replication strategies
    • Database sharding by geographic regions
  3. Network Optimization:
    • Optimize for different network conditions
    • Implement adaptive bitrate for media content
    • Handle varying connectivity patterns
Localization & Cultural Adaptation

Key Areas:

  1. Language Support:
    • Multi-language UI and content
    • Right-to-left language support
    • Character encoding for different scripts
  2. Cultural Considerations:
    • Local content moderation policies
    • Regional social norms and preferences
    • Local payment methods and currencies
  3. Regulatory Compliance:
    • Data privacy laws (GDPR, CCPA equivalents)
    • Content regulation requirements
    • Local business registration and taxes
Architecture Considerations
Global Load Balancer โ†’ Regional Clusters โ†“ โ†“ DNS Routing โ†’ [US, EU, APAC, LATAM] โ†“ โ†“ CDN Network โ†’ Regional Services โ†“ โ†“ ```
Content Delivery โ†’ User Data (Localized) โ†“ โ†“ Media Storage โ†’ Compliance Layer

Key Components:

NFRs & FRs

Functional Requirements:

Non-Functional Requirements:

Implementation Strategy

Phased Rollout:

  1. Phase 1: Infrastructure setup and basic localization
  2. Phase 2: Regional beta testing and compliance
  3. Phase 3: Full launch with marketing and support
  4. Phase 4: Feature optimization and local partnerships

Risk Mitigation:

๐Ÿ“ฑ Social Media Scaling
Variant 1: College App International Expansion
Given a social media app for college students, successfully running in US. How you will scale it to release worldwide?

๐Ÿค” Clarifying Questions:

  • What's the current user base and DAU in the US?
  • Which regions are we targeting first - Europe, Asia, or global?
  • Are there region-specific features needed (language, cultural)?
  • What's the current architecture - monolith or microservices?
  • Do we need to comply with data residency laws (GDPR, etc.)?
  • What's the expected user growth rate per region?

๐ŸŽฏ Main Challenges:

  • Latency: Single US data center creates high latency for global users
  • Compliance: GDPR, data residency laws in different countries
  • Cultural Adaptation: Different social norms and college systems
  • Infrastructure: Need multi-region deployment
  • Content Moderation: Different languages and cultural contexts

โœ… Scaling Sample Reasonings:

  • Geographic Distribution: Deploy to multiple AWS/GCP regions (US-East, EU-West, Asia-Pacific)
  • CDN Implementation: CloudFront/CloudFlare for static content delivery
  • Database Sharding: Shard by region/university to reduce cross-region queries
  • Microservices Architecture: Break down monolith for independent scaling
  • Localization Service: Separate service for multi-language support
  • Regional Load Balancers: Route users to nearest data center

๐Ÿ“‹ System Requirements:

Functional Requirements:

  • User registration and authentication
  • Post creation, sharing, and commenting
  • Real-time messaging and notifications
  • University-specific groups and events
  • Multi-language support

Non-Functional Requirements:

  • Availability: 99.9% uptime globally
  • Latency: <200ms for feed loading, <500ms for posting
  • Scalability: Support 10M+ users across regions
  • Consistency: Eventual consistency for feeds, strong for user data

๐Ÿ› ๏ธ Recommended Architecture:

API Gateway: Amazon API Gateway with regional endpoints

Application Layer: Node.js/Python microservices in containers

Database: PostgreSQL with read replicas per region

Cache: Redis Cluster for session management

Message Queue: Apache Kafka for real-time features

File Storage: S3 with cross-region replication

โš–๏ธ Implementation Strategy:

Phase 1: Deploy read replicas in target regions

Phase 2: Implement CDN and regional load balancing

Phase 3: Add localization and compliance features

Phase 4: Regional data residency and full autonomy

Variant 2: American University App International Release
Some students of an American university built an app and deployed in United States, and now they want to release it internationally. What should be the concerns before doing so and how should they take actions accordingly?

๐Ÿค” Clarifying Questions:

  • What type of app - social, academic, or utility?
  • What's the current user base and server capacity?
  • Is it monetized? Revenue model?
  • What countries are targeted first?
  • What's the budget for international expansion?
  • Any partnerships with international universities?

๐Ÿšจ Pre-Launch Concerns:

  • Legal Compliance: GDPR, CCPA, data protection laws
  • Infrastructure Readiness: Single US server won't scale globally
  • Cultural Adaptation: Different education systems and social norms
  • Monetization: Different payment methods and currencies
  • Support & Maintenance: 24/7 support across timezones
  • Security: Enhanced security for global threats

โœ… Action Plan:

  • Legal Research: Hire legal counsel for each target country
  • Infrastructure Planning: Multi-region cloud deployment strategy
  • Localization: Translate app and adapt to local education systems
  • Payment Integration: Local payment gateways (Alipay, PayPal, etc.)
  • Security Audit: Penetration testing and compliance audits
  • Beta Testing: Pilot launch in 2-3 countries

๐Ÿ“‹ Technical Requirements:

Functional:

  • Multi-language support (UI/UX)
  • Local payment processing
  • Timezone-aware features
  • Regional content moderation

Non-Functional:

  • Performance: <3s app load time globally
  • Availability: 99.5% uptime minimum
  • Scalability: Handle 10x current traffic
  • Security: SOC 2 Type II compliance

๐Ÿ› ๏ธ Technical Implementation:

Cloud Provider: AWS/GCP multi-region deployment

CDN: CloudFlare for global content delivery

Database: Regional read replicas with master-slave setup

Monitoring: DataDog/New Relic for global monitoring

Security: WAF, DDoS protection, encrypted communications

โš–๏ธ Risk Mitigation:

Start Small: Launch in English-speaking countries first

Gradual Rollout: Country-by-country expansion

Legal First: Ensure compliance before technical deployment

Monitor & Adapt: Continuous feedback and iteration

Variant 3: US to International Expansion
A social media app is expanding from US to international regions. What are the things to keep in mind?

๐Ÿค” Clarifying Questions:

  • What's the current scale - users, requests/second, data volume?
  • Which regions are prioritized - developed or emerging markets?
  • What's the content type - text, images, video, or all?
  • Are there existing partnerships or local presence?
  • What's the expansion timeline and budget?

๐ŸŒ Key Considerations:

  • Infrastructure: Latency, bandwidth, server placement
  • Regulatory: Data protection, content laws, censorship
  • Cultural: Social norms, communication styles, features
  • Technical: Device types, network conditions, app store policies
  • Business: Monetization, local competition, partnerships

โœ… Comprehensive Strategy:

  • Infrastructure Scaling:
    • Deploy edge servers in target regions
    • Implement global CDN strategy
    • Regional database placement
  • Compliance & Legal:
    • GDPR compliance for EU
    • Data residency requirements
    • Content moderation policies
  • Localization:
    • Multi-language support
    • Cultural adaptation of features
    • Local payment methods

๐Ÿ“‹ System Requirements:

This is a read-heavy system with high availability needs

Functional Requirements:

  • User feeds and timelines
  • Real-time messaging
  • Content sharing and discovery
  • Notifications and alerts

Non-Functional Requirements:

  • Latency: <200ms for feed loading globally
  • Availability: 99.99% uptime
  • Consistency: Eventual consistency for feeds
  • Scalability: Support 100M+ global users

๐Ÿ› ๏ธ Global Architecture:

Load Balancing: GeoDNS routing to regional data centers

Application: Microservices with auto-scaling

Database: Sharded PostgreSQL with regional replicas

Cache: Multi-tier caching (Redis, CDN)

Message Queue: Apache Kafka for real-time features

Storage: S3 with cross-region replication

โš–๏ธ Implementation Phases:

Phase 1: CDN and edge caching deployment

Phase 2: Regional read replicas

Phase 3: Full regional data centers

Phase 4: Local compliance and partnerships

๐Ÿ† Leaderboard & Search
Top 100 Contestants Leaderboard
Efficient way to find top 100 contestants leaderboard and search contestant's progress by name.

๐Ÿค” Clarifying Questions:

  • How many total contestants are there?
  • How frequently do scores update?
  • Are there different categories or one global leaderboard?
  • What's the read vs write ratio?
  • Do we need real-time updates or near real-time is acceptable?
  • Are there historical leaderboards (daily, weekly, monthly)?

๐ŸŽฏ Core Challenges:

  • Performance: Fast retrieval of top 100 from millions of users
  • Consistency: Ensuring leaderboard accuracy with frequent updates
  • Search Efficiency: Quick name-based search across large dataset
  • Scalability: Handle increasing number of contestants
  • Real-time Updates: Show live score changes

โœ… Efficient Sample Reasonings:

  • Leaderboard Storage:
    • Redis Sorted Sets for O(log N) operations
    • Maintain top 100 in memory cache
    • Use ZADD for score updates, ZRANGE for top 100
  • Search Implementation:
    • Elasticsearch for full-text name search
    • Trie data structure for autocomplete
    • Hash index on contestant names
  • Data Pipeline:
    • Apache Kafka for real-time score updates
    • Stream processing for leaderboard updates
    • Batch processing for historical data

๐Ÿ“‹ System Requirements:

This is a read-heavy system with occasional writes

Functional Requirements:

  • Display top 100 contestants
  • Search contestants by name
  • Show individual contestant progress
  • Real-time score updates

Non-Functional Requirements:

  • Latency: <50ms for leaderboard, <100ms for search
  • Throughput: 10K reads/sec, 1K writes/sec
  • Availability: 99.9% uptime
  • Consistency: Strong consistency for scores

๐Ÿ› ๏ธ Technical Architecture:

Cache Layer: Redis Cluster with Sorted Sets

Database: PostgreSQL for persistent storage

Search Engine: Elasticsearch for name search

Message Queue: Apache Kafka for real-time updates

API Gateway: Rate limiting and caching

Monitoring: Prometheus for metrics

โš–๏ธ Design Decisions:

Redis Sorted Sets: Chosen for O(log N) complexity and atomic operations

Elasticsearch: Superior search capabilities vs SQL LIKE queries

Kafka: Ensures reliable message delivery for score updates

Caching Strategy: Cache top 100 for 1 minute, invalidate on updates

๐Ÿ“ฑ Mobile App Media Content
Variant 1: Puzzle App Media Trade-offs
A mobile application for playing puzzles has some media content with it - audio, video and images. What are the trade-offs for fetching these media online or storing them offline in the app?

๐Ÿค” Clarifying Questions:

  • What's the total size of media content (MB/GB)?
  • How frequently is media content updated?
  • What's the target device storage capacity?
  • Are users primarily on WiFi or cellular?
  • Is the app used offline frequently?
  • What's the user retention rate?
  • Are there different puzzle categories with different media?

๐ŸŽฏ Main Issues & Analysis:

Core Challenge: Balancing user experience (fast loading, offline capability) with technical constraints (storage, bandwidth, app size)

Key Considerations: App store size limits, device storage, network reliability, content freshness

๐Ÿ“ฑ Offline Storage (Bundled) - Pros:

  • Instant Loading: No network latency, immediate media access
  • Offline Capability: App works without internet connection
  • Consistent Experience: No loading delays or failed downloads
  • Reduced Server Load: No CDN costs for media delivery
  • Battery Efficiency: No network calls for media

๐Ÿ“ฑ Offline Storage - Cons:

  • Large App Size: App store download barriers, storage requirements
  • Update Complexity: Entire app update needed for media changes
  • Device Storage: Consumes user device storage permanently
  • Initial Download Time: Longer first-time installation
  • Version Management: Difficult to A/B test media content

๐ŸŒ Online Fetching - Pros:

  • Smaller App Size: Faster downloads, lower storage requirements
  • Dynamic Content: Easy to update, personalize, and A/B test
  • Scalable Storage: Unlimited content without app size constraints
  • Analytics: Track media usage patterns and preferences
  • Personalization: Serve relevant media based on user behavior

๐ŸŒ Online Fetching - Cons:

  • Network Dependency: Requires internet connection
  • Loading Delays: Network latency affects user experience
  • Bandwidth Costs: Data usage for users, CDN costs for company
  • Inconsistent Performance: Depends on network quality
  • Battery Drain: Network operations consume more battery

๐Ÿ“‹ Requirements Analysis:

Functional Requirements:

  • Display high-quality images, videos, and audio
  • Support offline puzzle gameplay
  • Handle media updates and new content
  • Maintain consistent user experience

Non-Functional Requirements:

  • Performance: <2s loading time for media
  • Availability: 99.9% uptime for online content
  • Scalability: Handle 1M+ concurrent users
  • Storage: <100MB app size preferred
  • Bandwidth: Optimize for 3G/4G networks

๐Ÿ› ๏ธ Hybrid Approach - Best of Both Worlds:

Core Assets Offline: Essential UI elements, basic sounds, loading screens

Dynamic Content Online: Puzzle-specific media, seasonal content

Intelligent Caching: Cache frequently accessed media locally

Progressive Loading: Load media as needed, not all at once

Compression: WebP for images, compressed audio formats

โš–๏ธ Recommended Hybrid Strategy:

Implementation Plan:

  • Critical Assets Offline: Core UI, essential sounds (<20MB)
  • Lazy Loading: Download puzzle media when accessed
  • Smart Caching: Cache user's favorite puzzle categories
  • Offline Graceful Degradation: Basic functionality without media
  • CDN with Edge Caching: CloudFront for fast global delivery
  • Progressive Enhancement: Better experience with good connectivity

This approach provides fast initial loading while maintaining flexibility for content updates.

๐Ÿค– ML Service Scaling
Variant 1: Sports News ML Service
An ML based service exists for sports news app. What are the things to keep in mind when evaluating the scaling needs for the service for the next one year?

๐Ÿค” Clarifying Questions:

  • What ML tasks does the service perform (recommendation, classification, NLP)?
  • What's the current traffic pattern - real-time or batch processing?
  • How computationally intensive are the ML models?
  • What's the expected user growth rate?
  • Are there seasonal patterns (sports seasons, major events)?
  • What's the current infrastructure (CPU, GPU, cloud)?
  • How frequently are models retrained?
  • What's the latency requirement for predictions?

๐ŸŽฏ Main Issues & Analysis:

Core Challenge: ML workloads are computationally expensive and have unpredictable scaling patterns

Key Considerations: Model inference latency, training scalability, seasonal traffic spikes, cost optimization

๐Ÿ“Š Scaling Factors to Consider:

๐Ÿš€ Traffic & User Growth:

  • User Base Expansion: Estimate 2-5x growth in active users
  • Geographic Expansion: New markets mean different sports, languages
  • Feature Expansion: New ML features increase compute requirements
  • Real-time vs Batch: Shift towards real-time personalization

โšก Performance Requirements:

  • Latency: <100ms for real-time recommendations
  • Throughput: Handle 10K+ predictions per second
  • Availability: 99.95% uptime during peak sports events
  • Model Accuracy: Maintain quality while scaling

๐Ÿ“ˆ Seasonal & Event-Based Scaling:

  • Sports Seasons: 10x traffic during major tournaments
  • Breaking News: Sudden spikes during major events
  • Time Zones: Global events create rolling traffic waves
  • Weekend Patterns: Higher engagement during games

๐Ÿ“‹ ML Service Requirements:

Functional Requirements:

  • Personalized sports news recommendations
  • Real-time content classification and tagging
  • Sentiment analysis for articles and comments
  • Trending topic detection
  • User engagement prediction

Non-Functional Requirements:

  • Scalability: Auto-scale from 1K to 100K+ requests/sec
  • Latency: <100ms P95 for inference
  • Availability: 99.95% uptime
  • Cost Efficiency: Optimize compute costs per prediction
  • Model Freshness: Daily model updates

๐Ÿ› ๏ธ Scalable ML Architecture:

Model Serving: Kubernetes with HPA, TensorFlow Serving

Load Balancing: NGINX with health checks and circuit breakers

Caching: Redis for frequently requested predictions

GPU Infrastructure: AWS EC2 P4/G4 instances for training

Auto-scaling: KEDA for event-driven scaling

Monitoring: Prometheus + Grafana for model performance

Feature Store: Feast for consistent feature serving

๐Ÿ—๏ธ Scaling Strategies:

1. Horizontal Scaling:

  • Model Replicas: Deploy multiple model instances
  • Load Balancing: Distribute requests across replicas
  • Auto-scaling: Scale based on CPU, memory, and queue depth

2. Model Optimization:

  • Model Quantization: Reduce model size and inference time
  • Model Pruning: Remove unnecessary parameters
  • Ensemble Optimization: Use lighter models for real-time, heavy for batch

3. Caching & Preprocessing:

  • Result Caching: Cache common predictions
  • Feature Caching: Precompute expensive features
  • Batch Processing: Process similar requests together

โš–๏ธ Scaling Plan for Next Year:

Phase 1 (0-3 months):

  • Implement horizontal auto-scaling with Kubernetes
  • Add Redis caching for frequent predictions
  • Set up comprehensive monitoring and alerting

Phase 2 (3-6 months):

  • Model optimization and quantization
  • Implement feature store for consistency
  • Add GPU-based training pipeline

Phase 3 (6-12 months):

  • Multi-region deployment for global scaling
  • Advanced caching strategies
  • Cost optimization and right-sizing

Budget 3-5x current infrastructure costs to handle projected growth.

๐Ÿ–ฅ๏ธ Server Capacity Planning
Variant 1: Storage Size & CPU Load Calculation
Some inputs/numbers are given. Based on storage size and CPU/query load calculate how many servers do we need in both cases.

๐Ÿค” Clarifying Questions:

  • What are the specific input numbers provided?
  • What's the expected query rate (QPS)?
  • What's the total storage requirement?
  • What's the per-server storage capacity?
  • What's the CPU utilization per query?
  • What's the target CPU utilization threshold?
  • Do we need replication for high availability?
  • Are there read vs write workload differences?

๐ŸŽฏ Main Issues & Analysis:

Core Challenge: Right-sizing infrastructure to handle both storage and compute requirements efficiently

Key Considerations: Storage vs compute bottlenecks, replication overhead, peak vs average load

๐Ÿ“Š Capacity Planning Framework:

๐Ÿ’พ Storage-Based Calculation:

Formula: Number of servers = (Total Storage Required + Replication Factor) / (Usable Storage per Server)

Example Calculation:

  • Total Storage Required: 10TB
  • Replication Factor: 3 (for high availability)
  • Server Storage: 2TB SSD (80% usable = 1.6TB)
  • Servers Needed: (10TB ร— 3) / 1.6TB = 19 servers

โšก CPU/Query Load Calculation:

Formula: Number of servers = (Peak QPS ร— CPU per Query) / (CPU Cores ร— Target Utilization)

Example Calculation:

  • Peak QPS: 100,000 queries/second
  • CPU per Query: 10ms of CPU time
  • Server Specs: 16 cores, 2.5GHz
  • Target Utilization: 70%
  • Servers Needed: (100,000 ร— 0.01s) / (16 ร— 0.7) = 89 servers

๐Ÿ“‹ System Requirements:

Functional Requirements:

  • Handle specified query load with acceptable latency
  • Store required data with durability guarantees
  • Provide high availability and fault tolerance
  • Support expected growth patterns

Non-Functional Requirements:

  • Performance: <100ms P95 latency
  • Availability: 99.9% uptime
  • Scalability: Handle 2x growth without redesign
  • Cost: Optimize for cost per query/GB

๐Ÿ› ๏ธ Capacity Planning Tools & Approach:

Monitoring: Prometheus + Grafana for resource utilization

Load Testing: JMeter/K6 for performance benchmarking

Capacity Modeling: Excel/Python for growth projections

Auto-scaling: AWS Auto Scaling Groups, Kubernetes HPA

Cost Optimization: Reserved instances, spot instances for batch jobs

๐Ÿ—๏ธ Detailed Calculation Steps:

Step 1: Gather Requirements

  • Peak QPS and average QPS
  • Storage requirements (current and projected)
  • Latency requirements (P95, P99)
  • Availability requirements
  • Budget constraints

Step 2: Server Specifications

  • CPU: Cores, clock speed, architecture
  • Memory: RAM size, type (DDR4/DDR5)
  • Storage: SSD/HDD, capacity, IOPS
  • Network: Bandwidth, latency

Step 3: Workload Analysis

  • CPU utilization per query type
  • Memory usage patterns
  • I/O patterns (read/write ratio)
  • Network bandwidth requirements

Step 4: Safety Margins

  • Peak load multiplier: 2-3x average
  • Growth buffer: 50-100% for next year
  • Failure tolerance: N+1 or N+2 redundancy
  • Maintenance windows: 10-15% overhead

โš–๏ธ Final Recommendation:

Choose the Higher Number: Take the maximum of storage-based and CPU-based calculations

Example Final Calculation:

  • Storage-based: 19 servers
  • CPU-based: 89 servers
  • Choose: 89 servers (CPU is the bottleneck)
  • Add growth buffer: 89 ร— 1.5 = 134 servers
  • Add redundancy: 134 ร— 1.1 = 147 servers