Concepts

Basic Concepts for System Design

1. Scalability (Vertical vs. Horizontal)

Definition:

Scalability refers to a system’s ability to handle increased load by adding resources.

Types of Scalability:

Vertical Scaling (Scale Up):
- What it is: Adding more resources (CPU, RAM, storage) to a single machine.
- Pros:
  - Simpler to implement.
  - No need for distributed systems.
- Cons:
  - Limited by hardware constraints.
  - Single point of failure.
- Example: Upgrading a server from 16GB to 64GB of RAM.
Horizontal Scaling (Scale Out):
- What it is: Adding more machines to distribute the load.
- Pros:
  - No hardware limitations.
  - Fault-tolerant (no single point of failure).
- Cons:
  - More complex to implement (requires distributed systems).
  - May introduce consistency and coordination challenges.
- Example: Adding more servers to a web application cluster.

How to Present in Interviews:

Explain the trade-offs: Vertical scaling is simpler but limited, while horizontal scaling is more complex but offers better fault tolerance and scalability.
Use examples: Mention real-world systems like Netflix (horizontal scaling) vs. small-scale applications (vertical scaling).
Relate to the problem: If the system needs to handle millions of users, emphasize horizontal scaling.

2. Latency, Throughput, and Bandwidth

Definitions:

Latency:
- The time it takes for a request to travel from the client to the server and back.
- Measured in milliseconds (ms).
- Example: API response time.
Throughput:
- The number of requests a system can handle per unit of time.
- Measured in requests per second (RPS).
- Example: Number of transactions processed by a database per second.
Bandwidth:
- The maximum amount of data that can be transferred over a network in a given time.
- Measured in bits per second (bps).
- Example: Network capacity of a server.

How to Present in Interviews:

Explain the relationship: Lower latency improves user experience, while higher throughput ensures the system can handle more load.
Use metrics: Mention typical latency values (e.g., 200ms for web apps) and throughput requirements (e.g., 10,000 RPS for high-traffic systems).
Optimization strategies: Discuss techniques like caching (reduces latency), load balancing (improves throughput), and compression (saves bandwidth).

3. Availability and Reliability

Definitions:

Availability:
- The percentage of time a system is operational and accessible.
- Measured as a percentage (e.g., 99.9% uptime).
- Example: A system with 99.9% availability is down for ~8.76 hours/year.
Reliability:
- The probability that a system will perform its intended function without failure over a specified period.
- Example: A database that consistently returns correct results.

How to Present in Interviews:

Explain the difference: Availability is about uptime, while reliability is about correctness.
Discuss trade-offs: High availability often requires redundancy (e.g., multiple servers), which can increase complexity.
Mention techniques: Use replication, failover mechanisms, and monitoring to improve availability and reliability.

4. Consistency Models (Strong vs. Eventual)

Definitions:

Strong Consistency:
- All reads reflect the most recent write.
- Example: Relational databases (e.g., MySQL, PostgreSQL).
Eventual Consistency:
- Reads may return stale data initially, but the system will eventually converge to consistency.
- Example: Distributed systems like DynamoDB, Cassandra.

How to Present in Interviews:

Explain use cases: Strong consistency is critical for financial systems, while eventual consistency is acceptable for social media feeds.
Discuss trade-offs: Strong consistency can increase latency, while eventual consistency improves availability and performance.
Mention CAP theorem: Relate consistency to the CAP theorem (see below).

5. CAP Theorem

Definition:

The CAP theorem states that in a distributed system, you can only guarantee two out of the following three properties:

Consistency (C): All nodes see the same data at the same time.
Availability (A): Every request receives a response (success or failure).
Partition Tolerance (P): The system continues to operate despite network partitions.

Implications:

CA Systems: Prioritize consistency and availability (e.g., traditional databases). Sacrifice partition tolerance.
CP Systems: Prioritize consistency and partition tolerance (e.g., distributed databases like MongoDB). Sacrifice availability.
AP Systems: Prioritize availability and partition tolerance (e.g., Cassandra, DynamoDB). Sacrifice consistency.

How to Present in Interviews:

Explain the trade-offs: Use real-world examples (e.g., banking systems are CP, social media is AP).
Relate to the problem: If the system must handle network failures, prioritize partition tolerance and choose between consistency and availability.
Mention BASE: Briefly introduce BASE (Basically Available, Soft state, Eventual consistency) as an alternative to ACID for AP systems.

How to Present These Concepts in Interviews

Start with Definitions:
- Clearly define the concept in simple terms.
- Use analogies or examples to make it relatable.
Explain Trade-offs:
- Discuss the pros and cons of each concept.
- Relate trade-offs to real-world systems.
Use Metrics and Examples:
- Provide specific numbers (e.g., latency in ms, availability percentages).
- Mention well-known systems that use these concepts.
Relate to the Problem:
- Tailor your explanation to the system being designed.
- Explain why a particular concept is relevant to the problem.
Be Clear and Concise:
- Avoid jargon unless necessary.
- Use diagrams or whiteboard sketches to illustrate your points.

Example Interview Response

Interviewer: “How would you design a system to handle high traffic?”

You:
“To handle high traffic, I’d focus on horizontal scalability by adding more servers to distribute the load. This approach ensures fault tolerance and avoids the limitations of vertical scaling.
For latency, I’d use caching (e.g., Redis) to reduce response times and a CDN to serve static content closer to users.
To ensure availability, I’d design the system with redundancy and automatic failover, aiming for at least 99.9% uptime.
Since high traffic systems often prioritize performance, I’d consider eventual consistency for non-critical data to improve throughput.
Finally, I’d keep the CAP theorem in mind, prioritizing availability and partition tolerance for a globally distributed system.”