Twitter - System Design

First Created: 2021-01-23 23:10

A read-heavy system (compared to write)

 

Main Functions
1. Users are able to post tweet(text, image, video)

2. Users are able to follow/unfollow others

3. The service should display a user's all posts sorted by timeline in his/her mainpage

4. A user can view his followers' posts

Non-functional requirements

High available

It is ok to have eventual consistency and 200ms latency for example.

Each tweet has a limit of 140 characters.

 

Daily Active

Daily Active Uers: 200Million

100 million new tweets every day = 1150 write QPS

 

System API

POST

post_tweet(api_dev_key, tweet_text, location, imageId, videoId)

get_tweets(api_dev_key, max_number_to_return, next_page_token)

follow(userId)

 

202 Accepted 

200 OK

  

Schema

User Profile DB

userId (Primary Key) | Name | Description | PhotoId | Email | DateOfBirth | CreationDate | Tweets[] | favoriteTweets[]

Tweet DB

TweetId (Primary Key) | userId | Text | ImageIds[] | VideoId[] | Latitude | Longitude | timestamp | Hashtag | (originTweetId in Re-tweet)

Follow DB (if SQL)

userId1, userI2 (Primary Key) 

Tweet Favorite DB (if SQL)

FavoriteId (Primary Key) | TweetId | userId | timestamp

 

ImageDB (same as video)

ImageId (Primary Key) | image_url

Followee DB (specially for hot user)

userId (Primary Key) | userIds []

 

FANOUT Service

Async

Non-celebrity:

Save tweet to DB/Cache

Fetch all the followers that follow user A

Inject this tweet into all the followers' queues/in-memory timelines

Finally, all the followers can see this tweet in their timelines

 

Celebrity:

Pre-compute non-celebrity tweets

Mixed with the tweets from celebrity in the runtime of user's request

 

Not computing for inactive users (not log in for more than 15 days)

 

Infra

[Clients] -> [Load Balancers] -> [Application Servers] <-> [Databases / File Storages] 

 

Sharding 

1. By userID: hot user issue / not uniform distrubution of tweets

2. By TweetID: need aggregation and return top tweets. App server will merge all the results.

Need to query all the databases.

3. By CreationTime: can be quick to find latest tweets. But it only queries a small set of servers (recent tweets) . Not suggested.

4. Combined timestamp into tweetID. Epoch time + auto-incrementing.  64 bites = 8 bytes. 

 

Cache

- improve read performance and reduce database pressure

- least recently used (LRU)

- try to cache 20% tweets which have 80% traffic of reading (size of cache) in the past 3 days

- Due to limit of number of connections. It should be split into multiple servers.

- celebrities timeline should be in the cache.

- key : userId, value: tweets (double linked list due to descending order)

 

Monitor/Metrics

Number of tweets

Latency

 

Trending Topics/Top news

By search queries/ hashtag / re-tweets

 

Search

API : search_tweets(api_dev_key, key_word, user_location, sort_method, max_number_to_return, next_page_token) 

- Schema

ID, Word/Term, Document IDs (inverted full-text index)

[Index Server] index read / index update

- Partition by Term/Word : 1. some words may contain a lot of document(status) IDs 2. Only one server is working

- Partition by Status ID : Will query all the servers and do aggregations to return to the user.

- Fault Tolerance

Every Index server has a backup (a secondary sync on the different rack in the same data center).

Use [Index-Builder Server] to rebuild the dead index server 

 

Reference

[1]  https://github.com/donnemartin/system-design-primer/tree/master/solutions/system_design/twitter

[2] https://medium.com/@narengowda/system-design-for-twitter-e737284afc95

[3] https://www.educative.io/courses/grokking-the-system-design-interview/xV9mMjj74gE

posted @ 2021-01-24 15:12  YBgnAW  阅读(148)  评论(0编辑  收藏  举报