README.md |
The System Design Primer
Motivation
Learn how to design large scale systems from the open source community.
Understand real-world architectures.
Prep for the system design interview.
Learn how to design large scale systems
Learning how to design scalable systems will make you a better engineer.
System design is a broad topic. There is a vast amount of resources scattered throughout the web on system design principles.
This repo is an organized collection of resources to help you learn how to build systems at scale.
Topics for learning system design:
Learn from the open source community
This is an early draft of a continually updated, open source project.
Contributions are welcome!
Prep for the system design interview
In addition to coding interviews, system design is a required component of the technical interview process at many tech companies.
Practice common system design interview questions and compare your results with sample discussions, code, and diagrams.
Additional topics for interview prep:
- Study guide
- How to approach a system design interview question
- System design interview questions, with solutions
- Object-oriented design interview questions, with solutions
- Additional system design interview questions
For interviews, do I need to know everything here?
No, you don't need to know everything here to prepare for the interview.
What you are asked in an interview depends on variables such as:
- How much experience you have
- What your technical background is
- What positions you are interviewing for
- Which companies you are interviewing with
- Luck
More experienced candidates are generally expected to know more about system design. Architects or team leads might be expected to know more than individual contributors. Top tech companies are likely to have one or more design interview rounds.
Any resources to prep for coding interviews?
Check out the sister repo interactive-coding-challenges for coding interview resources.
Contributing
Learn from the community.
Feel free to submit pull requests to help:
- Fix errors
- Improve sections
- Add new sections
Content that needs some polishing is placed under development.
Review the Contributing Guidelines.
Index of system design topics
Summaries of various system design topics, including pros and cons. Everything is a trade-off.
Each section contains links to more in-depth resources.
- System design topics: start here
- Performance vs scalability
- Latency vs throughput
- Availability vs consistency
- Consistency patterns
- Availability patterns
- Domain name system
- Content delivery network
- Load balancer
- Reverse proxy (web server)
- Application layer
- Database
- Cache
- Asynchronism
- Communication
- Security
- Appendix
- Under development
- Credits
- Contact info
- License
Study guide
Suggested topics to review based on your interview timeline (short, medium, long).
Start broad and go deeper in a few areas. It helps to know a little about various key system design topics. Adjust the following guide based on your experience, what positions you are interviewing for, and which companies you are interviewing with.
- Short - Aim for breadth with system design topics. Practice by solving some interview questions.
- Medium - Aim for breadth and some depth with system design topics. Practice by solving a many interview questions.
- Long - Aim for breadth and more depth with system design topics. Practice by solving a most interview questions.
Short | Medium | Long | |
---|---|---|---|
Read through the System design topics to get a broad understanding of how systems work | 👍 | 👍 | 👍 |
Read through a few articles in the Company engineering blogs for the companies you are interviewing with | 👍 | 👍 | 👍 |
Read through a few Real world architectures | 👍 | 👍 | 👍 |
Review How to approach a system design interview question | 👍 | 👍 | 👍 |
Work through System design interview questions with solutions | Some | Many | Most |
Work through Object-oriented design interview questions with solutions | Some | Many | Most |
Review Additional system design interview questions | Some | Many | Most |
How to approach a system design interview question
How to tackle a system design interview question.
The system design interview is an open-ended conversation. You are expected to lead it.
You can use the following steps to guide the discussion. To help solidify this process, work through the System design interview questions with solutions section using the following steps.
Step 1: Outline use cases, constraints, and assumptions
Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
- Who is going to use it?
- How are they going to use it?
- How many users are there?
- What does the system do?
- What are the inputs and outputs of the system?
- How much data do we expect to handle?
- How many requests per second do we expect?
- What is the expected read to write ratio?
Step 2: Create a high level design
Outline a high level design with all important components.
- Sketch the main components and connections
- Justify your ideas
Step 3: Design core components
Dive into details for each core component. For example, if you were asked to design a url shortening service, discuss:
- Generating and storing a hash of the full url
- Translating a hashed url to the full url
- Database lookup
- API and object-oriented design
Step 4: Scale the design
Identify and address bottlenecks, given the constraints. For example, do you need the following to address scalability issues?
- Load balancer
- Horizontal scaling
- Caching
- Database sharding
Discuss potential solutions and trade-offs. Everything is a trade-off. Address bottlenecks using principles of scalable system design.
Back-of-the-envelope calculations
You might be asked to do some estimates by hand. Refer to the Appendix for the following resources:
- Use back of the envelope calculations
- Powers of two table
- Latency numbers every programmer should know
Source(s) and further reading
Check out the following links to get a better idea of what to expect:
- How to ace a systems design interview
- The system design interview
- Intro to Architecture and Systems Design Interviews
System design interview questions with solutions
Common system design interview questions with sample discussions, code, and diagrams.
Solutions linked to content in the
solutions/
folder.
Question | |
---|---|
Design Pastebin.com (or Bit.ly) | Solution |
Design the Twitter timeline (or Facebook feed) Design Twitter search (or Facebook search) |
Solution |
Design a web crawler | Solution |
Design Mint.com | Solution |
Design the data structures for a social network | Solution |
Design a key-value store for a search engine | Solution |
Design Amazon's sales ranking by category feature | Solution |
Design a system that scales to millions of users on AWS | Solution |
Add a system design question | Contribute |
Design Pastebin.com (or Bit.ly)
Design the Twitter timeline and search (or Facebook feed and search)
Design a web crawler
Design Mint.com
Design the data structures for a social network
Design a key-value store for a search engine
Design Amazon's sales ranking by category feature
Design a system that scales to millions of users on AWS
Object-oriented design interview questions with solutions
Common object-oriented design interview questions with sample discussions, code, and diagrams.
Solutions linked to content in the
solutions/
folder.
Note: This section is under development
Question | |
---|---|
Design a deck of cards to be used for blackjack | Solution |
Design a call center | Solution |
Design a hash map | Solution |
Design a least recently used cache | Solution |
Design a parking lot | Solution |
Design a chat server | Solution |
Design a circular array | Contribute |
Add an object-oriented design question | Contribute |
Additional system design interview questions
Common system design interview questions, with links to resources on how to solve each.
Question | Reference(s) |
---|---|
Design a file sync service like Dropbox | youtube.com |
Design a search engine like Google | queue.acm.org stackexchange.com ardendertat.com stanford.edu |
Design a scalable web crawler like Google | quora.com |
Design Google docs | code.google.com neil.fraser.name |
Design a key-value store like Redis | slideshare.net |
Design a cache system like Memcached | slideshare.net |
Design a recommendation system like Amazon's | hulu.com ijcai13.org |
Design a tinyurl system like Bitly | n00tc0d3r.blogspot.com |
Design a chat app like WhatsApp | highscalability.com |
Design a picture sharing system like Instagram | highscalability.com highscalability.com |
Design the Facebook news feed function | quora.com quora.com slideshare.net |
Design the Facebook timeline function | facebook.com highscalability.com |
Design the Facebook chat function | erlang-factory.com facebook.com |
Design a graph search function like Facebook's | facebook.com facebook.com facebook.com |
Design a content delivery network like CloudFlare | cmu.edu |
Design a trending topic system like Twitter's | michael-noll.com snikolov .wordpress.com |
Design a random ID generation system | blog.twitter.com github.com |
Return the top k requests during a time interval | ucsb.edu wpi.edu |
Design a system that serves data from multiple data centers | highscalability.com |
Design an online multiplayer card game | indieflashblog.com buildnewgames.com |
Design a garbage collection system | stuffwithstuff.com washington.edu |
Add a system design question | Contribute |
Real world architectures
Articles on how real world systems are designed.
Source: Twitter timelines at scale
Don't focus on nitty gritty details for the following articles, instead:
- Identify shared principles, common technologies, and patterns within these articles
- Study what problems are solved by each component, where it works, where it doesn't
- Review the lessons learned
Type | System | Reference(s) |
---|---|---|
Data processing | MapReduce - Distributed data processing from Google | research.google.com |
Data processing | Spark - Distributed data processing from Databricks | slideshare.net |
Data processing | Storm - Distributed data processing from Twitter | slideshare.net |
Data store | Bigtable - Distributed column-oriented database from Google | harvard.edu |
Data store | HBase - Open source implementation of Bigtable | slideshare.net |
Data store | Cassandra - Distributed column-oriented database from Facebook | slideshare.net |
Data store | DynamoDB - Document-oriented database from Amazon | harvard.edu |
Data store | MongoDB - Document-oriented database | slideshare.net |
Data store | Spanner - Globally-distributed database from Google | research.google.com |
Data store | Memcached - Distributed memory caching system | slideshare.net |
Data store | Redis - Distributed memory caching system with persistence and value types | slideshare.net |
File system | Google File System (GFS) - Distributed file system | research.google.com |
File system | Hadoop File System (HDFS) - Open source implementation of GFS | apache.org |
Misc | Chubby - Lock service for loosely-coupled distributed systems from Google | research.google.com |
Misc | Dapper - Distributed systems tracing infrastructure | research.google.com |
Misc | Kafka - Pub/sub message queue from LinkedIn | slideshare.net |
Misc | Zookeeper - Centralized infrastructure and services enabling synchronization | slideshare.net |
Add an architecture | Contribute |
Company architectures
Company engineering blogs
Architectures for companies you are interviewing with.
Questions you encounter might be from the same domain.
- Airbnb Engineering
- Atlassian Developers
- Autodesk Engineering
- AWS Blog
- Bitly Engineering Blog
- Box Blogs
- Cloudera Developer Blog
- Dropbox Tech Blog
- Engineering at Quora
- Ebay Tech Blog
- Evernote Tech Blog
- Etsy Code as Craft
- Facebook Engineering
- Flickr Code
- Foursquare Engineering Blog
- GitHub Engineering Blog
- Google Research Blog
- Groupon Engineering Blog
- Heroku Engineering Blog
- Hubspot Engineering Blog
- High Scalability
- Instagram Engineering
- Intel Software Blog
- Jane Street Tech Blog
- LinkedIn Engineering
- Microsoft Engineering
- Microsoft Python Engineering
- Netflix Tech Blog
- Paypal Developer Blog
- Pinterest Engineering Blog
- Quora Engineering
- Reddit Blog
- Salesforce Engineering Blog
- Slack Engineering Blog
- Spotify Labs
- Twilio Engineering Blog
- Twitter Engineering
- Uber Engineering Blog
- Yahoo Engineering Blog
- Yelp Engineering Blog
- Zynga Engineering Blog
Source(s) and further reading
System design topics: start here
New to system design?
First, you'll need a basic understanding of common principles, learning about what they are, how they are used, and their pros and cons.
Step 1: Review the scalability video lecture
Scalability Lecture at Harvard
- Topics covered:
- Vertical scaling
- Horizontal scaling
- Caching
- Load balancing
- Database replication
- Database partitioning
Step 2: Review the scalability article
- Topics covered:
Next steps
Next, we'll look at high-level trade-offs:
- Performance vs scalability
- Latency vs throughput
- Availability vs consistency
Keep in mind that everything is a trade-off.
Then we'll dive into more specific topics such as DNS, CDNs, and load balancers.
Performance vs scalability
A service is scalable if it results in increased performance in a manner proportional to resources added. Generally, increasing performance means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow.1
Another way to look at performance vs scalability:
- If you have a performance problem, your system is slow for a single user.
- If you have a scalability problem, your system is fast for a single user but slow under heavy load.
Source(s) and further reading
Latency vs throughput
Latency is the time to perform some action or to produce some result.
Throughput is the number of such actions or results per unit of time.
Generally, you should aim for maximal throughput with acceptable latency.
Source(s) and further reading
Availability vs consistency
CAP theorem
In a distributed computer system, you can only support two of the following guarantees:
- Consistency - Every read receives the most recent write or an error
- Availability - Every request receives a response, without guarantee that it contains the most recent version of the information
- Partition Tolerance - The system continues to operate despite arbitrary partitioning due to network failures
Networks aren't reliable, so you'll need to support partition tolerance. You'll need to make a software tradeoff between consistency and availability.
CP - consistency and partition tolerance
Waiting for a response from the partitioned node might result in a timeout error. CP is a good choice if your business needs require atomic reads and writes.
AP - availability and partition tolerance
Responses return the most recent version of the data, which might not be the latest. Writes might take some time to propagate when the partition is resolved.
AP is a good choice if the business needs allow for eventual consistency or when the system needs to continue working despite external errors.
Source(s) and further reading
Consistency patterns
With multiple copies of the same data, we are faced with options on how to synchronize them so clients have a consistent view of the data. Recall the definition of consistency from the CAP theorem - Every read receives the most recent write or an error.
Weak consistency
After a write, reads may or may not see it. A best effort approach is taken.
This approach is seen in systems such as memcached. Weak consistency works well in real time use cases such as VoIP, video chat, and realtime multiplayer games. For example, if you are on a phone call and lose reception for a few seconds, when you regain connection you do not hear what was spoken during connection loss.
Eventual consistency
After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously.
This approach is seen in systems such as DNS and email. Eventual consistency works well in highly available systems.
Strong consistency
After a write, reads will see it. Data is replicated synchronously.
This approach is seen in file systems and RDBMSes. Strong consistency works well in systems that need transactions.
Source(s) and further reading
Availability patterns
There are two main patterns to support high availability: fail-over and replication.
Fail-over
Active-passive
With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service.
The length of downtime is determined by whether the passive server is already running in 'hot' standy or whether it needs to start up from 'cold' standby. Only the active server handles traffic.
Active-passive failover can also be referred to as master-slave failover.
Active-active
In active-active, both servers are managing traffic, spreading the load between them.
If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers.
Active-active failover can also be referred to as master-master failover.
Disadvantage(s): failover
- Fail-over adds more hardware and additional complexity.
- There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.
Replication
Master-slave and master-master
This topic is further discussed in the Database section:
Domain name system
Source: DNS security presentation
A Domain Name System (DNS) translates a domain name such as www.example.com to an IP address.
DNS is hierarchical, with a few authoritative servers at the top level. Your router or ISP provides information about which DNS server(s) to contact when doing a lookup. Lower level DNS servers cache mappings, which could become stale due to DNS propagation delays. DNS results can also be cached by your browser or OS for a certain period of time, determined by the time to live (TTL).
- NS record (name server) - Specifies the DNS servers for your domain/subdomain.
- MX record (mail exchange) - Specifies the mail servers for accepting messages.
- A record (address) - Points a name to an IP address.
- CNAME (canonical) - Points a name to another name or
CNAME
(example.com to www.example.com) or to anA
record.
Services such as CloudFlare and Route 53 provide managed DNS services. Some DNS services can route traffic through various methods:
- Weighted round robin
- Prevent traffic from going to servers under maintenance
- Balance between varying cluster sizes
- A/B testing
- Latency-based
- Geolocation-based
Disadvantage(s): DNS
- Accessing a DNS server introduces a slight delay, although mitigated by caching described above.
- DNS server management could be complex, although they are generally managed by governments, ISPs, and large companies.
- DNS services have recently come under DDoS attack, preventing users from accessing websites such as Twitter without knowing Twitter's IP address(es).