Anatomy of a System Design Interview

System Design interviews are less about getting lucky and more about actually doing the hard work of attaining knowledge. When companies ask design questions, they want to evaluate your design skills and experience in designing large scale distributed systems. At the end, your performance in these interviews depends on the following 2 factors.

Your knowledge — gained either through studying or practical experience.
Your ability to articulate your thoughts.

Here’s a 7-step framework that I recommend to approach each problem. For keeping the examples real, we will pick up a common interview question: Design a scalable service like Twitter.

Resources:

Grokking

Coderust

System Design Questions

Requirement Gathering

Many candidates think that system design interviews are all about “scale”, forgetting to put required emphasis on the “system” part of the interview.

You need to have a working “system” before you can scale it.

As the first step in your interview, you should ask questions to find the exact scope of the problem. Design questions are mostly open-ended, and they don’t have ONE correct answer. That’s why clarifying ambiguities early in the interview becomes critical. Candidates who spend time in clearly defining the end goals of the system, always have a better chance of success.

Here are some questions for designing Twitter that should be answered before moving on to next steps:

Who can post a tweet? (answer: any user)
Who can read the tweet? (answer: any user — as all tweets are public)
Will a tweet contain photos or videos (answer: for now, just photos)
Can a user follow another user? (answer: yes).
Can a user ‘like’ a tweet? (answer: yes).
What gets included in the user feed (answer: tweets from everyone whom you are following).
Is feed a list of tweets in chronological order? (answer: for now, yes).
Can a user search for tweets (answer: yes).
Are we designing the client/server interaction or backend architecture or both (answer: we want to understand the interaction between client/server but we will focus on how to scale the backend).
How many total users are there (answer: we expect to reach 200 Million users in the first year).
How many daily active users are there (100 million users sign-in everyday)

If you notice, some of these answers are not exactly similar to the real Twitter, and that’s ok. It’s a hypothetical problem geared towards evaluating your approach. You are just asking these questions to scope the problem that you are going to solve today. (e.g. You now don’t have to worry about handling videos or generating a timeline using algorithms etc.)

System interface definition

Define what APIs are expected from the system. This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong. Some examples for our Twitter-like service would be:

postTweet(user_id, tweet_text, image_url, user_location, timestamp, …) 

generateTimeline(user_id, current_time) 

recordUserTweetLike(user_id, tweet_id, timestamp, …)

If you have gathered the requirements and can identify the APIs exposed by the system, you are 50% done.

Back-of-the-envelope capacity estimation

It’s always a good idea to estimate the scale of the system you’re going to design. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.

What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.)
How much storage would we need? (This will depend on whether users can upload photos and videos in their tweets)
What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers.

Defining the data model

Defining the data model early will clarify how data will flow among different components of the system. Later, it will guide you towards better data partitioning and management. Candidate should be able to identify various entities of the system, how they will interact with each other and different aspect of data management like storage, transfer, encryption, etc. Here are some entities for our Twitter-like service:

User: UserID, Name, Email, DoB, CreationData, LastLogin, etc.
Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
UserFollows: UserdID1, UserID2
FavoriteTweets: UserID, TweetID, TimeStamp

Which database system should we use? Would NoSQL like Cassandra best fits our needs, or we should use MySQL-like solution. What kind of blob storage should we use to store photos and videos?

High-level design

Draw a block diagram with 5–6 boxes representing core components of your system. You should identify enough components that are needed to solve the actual problem from end-to-end.

For Twitter, at a high level, we would need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we’ll have a lot more read traffic (as compared to write), we can decide to have separate servers for handling reads v.s writes. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We would also need a distributed file storage system for storing photos & videos and a search index and infrastructure to enable searching of tweets.

Detailed design for selected components

Dig deeper into 2–3 components; interviewers feedback should always guide you towards which parts of the system she wants you to explain further. You should be able to provide different approaches, their pros and cons, and why would you choose one? Remember there is no single answer, the only thing important is to consider tradeoffs between different options while keeping system constraints in mind. For instance:

Since we’ll be storing a huge amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issues can it cause?
How would we handle high-traffic users, e.g. celebrities who have millions of followers?
Since user’s timeline will contain the most recent (and relevant) tweets, should we try to store our data in a way that is optimized to scan latest tweets?
How much and at which layer should we introduce cache to speed things up?
What components need better load balancing?

Identify & resolve bottlenecks

Try to discuss as many bottlenecks as possible and different approaches to mitigate them.

Is there any single point of failure in our system? What are we doing to mitigate it?
Do we’ve enough replicas of the data so that if we lose a few servers, we can still serve our users? (High Availability)
Similarly, do we’ve enough copies of different services running, such that a few failures will not cause total system shutdown?
How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?