Youtube - System Design

CREATED 2021-04-19 10:25PM

 

Functional Requrements: upload, view, share, like/dislike, comment, search, change resolution...

Non-Functional Requrements: reliable(not losing data), available, real-time (CDN*)...

Future: Recommendations, Top K popular...

 

Scale of the system - data size

DAU 5 million. 10% of the DAU upload a video every day. Average video size is 300MB.

5 million * 10% * 300 MB = 150TB - Daily storage.

upload:view ratio - 1:200

 

APIs

uploadVideo(api_dev_key, video_title, video_description, tags[], category_id, default_language, recording_details, video_contents)

searchVideo(api_dev_key, search_word, user_location, maximum_videos_to_return, next_page_token)

streamVideo(api_dev_key, video_id, offset, codec, resolution)

 

Schema

MySQL

1 Video metadata storage - MySQL

Videos metadata can be stored in a SQL database. The following information should be stored with each video:

VideoID, Title, Timestamp, Description, Size, Thumbnail, User, Total number of likes, dislikes, Total number of views

2 User data storage - MySQL

UserID, Name, Email, Address, Age, etc.

(Billing information - MySQL)

 

NoSQL

Cassandra. It can handle heavy write and read.

It records the viewing history of users. 

 

Pure video chunks are saved in S3.

 

Services

1 Get Video Service: -> Distributed Storage & Metadata Database

2 Upload Service: -> Distributed Storage & Metadata Database & Distributed Queue (tasks : Preprocessing, Video Splitter, Encoding)

3 Search Service

 

Read Heavy System

HDFS or GlusterFS (DDIA page199)

Read/Write separate.

Write to primary/leader machines and the sync to secondary/follower machines.

 

Video uploading flow & video streaming flow

Transcoding server -> Transcoded storage -> CDN.

            -> Completion events stored at completion queue. -> Completion handler updates metadata database and cache.

Videos are streaming from CDN directly. Populat protocol like MPEG-DASH.

 

Origin video -> Video (Tasks: inspection, transcoding, thumbnail, watermark...[Resource Manager (Task Schedulers)]) -> Assembler

                  -> Audio (Encoding)                                         -> Assembler 

                  -> Metadata

 

Segregate our read traffic from write traffic.

we can have primary-secondary configurations where writes will go to primary first and then get applied at all the secondaries. 

 

Sharding (Trade off)

1. Sharding based on UserID

Hot user issue OR a user has many videos.

Need to query all servers if a user wants to search some video name.

Unbalanced storage. Some users have a lot of videos while others have few.

2. Sharding based on VideoID

Popular videos(need cache).

A centralized server will aggregate and rank these results before returning them to the customer.

3 Elastic Search ?

TODO

 

CDN

A CDN is a system of distributed servers that deliver web content to a user based on the user’s geographic locations. It replicates the content in multiple places around the world.

 

Error Handling

API server down: next machine according to consistent hashing

DB server down: secondary changes to primary if primary has outage.

posted @ 2021-04-27 13:51  YBgnAW  阅读(51)  评论(0)    收藏  举报