难度: easy
Solution
System requirements
Functional:
- The comments will be associated with a post
- The comments will be ordered by time
- Comments will contain information about the user, the time, and the comment that it is replying to, if relevant
- You should be able to reply to other comments
Non-Functional:
- System should be able to support up to 1000 nested comments
- Comments section should be retrieved with as little latency as possible
Capacity estimation
Assumptions:
- 10:1 read to write ratio
- 5 million new comments a day.
- Stores each comment for 10 years.
- Assuming a comment has an average of 10 replies each
Comment meta data
- PostID (8 bytes)
- CommentID (8 Bytes)
- UserID (8 Bytes)
- Content (1KB)
- Time (8 bytes)
- Array of replies (80 bytes)
Storage Estimations:
- 1 day -> 5 million * (8+8+8+8+1024+80) = 5.5GB a day
- 1 year = 2 TB
- 10 years = 20 TB
Bandwidth Estimations
- Read requests
- 50 million reads a day
- 2.1 million reads an hour
- 35,000 reads a minute
- 600 reads a second
- 666 KB per second
- Write
- 66.6 KB per second
Cache Estimations
- Assuming that we want to cache 20% of the most popular data
- If we are getting 50 million reads a day, we want to be able to cache 20% of 50 million -> 10 million requests
- 10 million requests = 10 million * (8+8+8+8+1024+80) = 10.6 GB
API design
GET comments(postID)
- Will be returned in a structural way where the comments are sorted in order of their time, then within each reply, it will also be sorted by the time
POST comment(postID)
- Posts your comment on the post, your time will be recorded
POST reply(commentID)
- Reply to a comment
Database design
We have two different types of database that we can use, lets see the pros and cons for both
Document DB
The first DB type is the DocumentDB, this is useful because our data (in particular the array of replies) is not structured, DocumentDB is useful for storing these types of unstructured data, it is easy to use and easy to query. However, DocumentDB is not useful when there are a lot of relations in the data, which is the case for our data.
GraphDB
The second type of DB is the GraphDB, this is useful when our data has a lot of deep relations. We can retrieve many all the replies in a post just from querying once, this is better compared to DocumentDB where you have to query the DB multiple times to get the necessary data.
The nature of our data is that it is interconnected, however, very rarely do we get super nested comments, although possible. Hence it is relational, but not too relational. In addition, we must ensure that the comments are retrieved in order of time, graphDB can do that as well. However we should minimize the quries to the DB as much as possible, hence because we can retrieve all the comments in a single query using GraphDB, we should use GraphDB
Hence, looking at the pros of GraphDB and the nature of our data, we will use GraphDB as our main database.
In terms of caching, we will use MemCache or Redis to keep the top 20% of comments. We should also keep comment section to indicate whether it has been modified (dirty bit), if so, we will have to query the DB again. We will use the Least Recently Used algorithm when evicting data from our Cache.
High-level design
Our client will point to our backend system which will point to the Cache and GraphDB
flowchart TD B[client] --> C{server} C --> D[Database]
Request flows
Scenario 1: Commenting on a post
Client calls the POST comment API to post a comment. The backend system will note the time that the request was made, in the DB, we will create a new edge from the postID to the comment
Scenario 2: Reply a comment
Client calls the POST reply API to post a reply. The backend system will note the time that the request was made, in the DB, we will create a new edge from the comment to the reply
Scenario 3: Retrieve comment section
Client calls the GET comment API to retrieve all the comments under a post by PostID. Using a single query, we can retrieve all the comments and the structure from the GraphDB. In the backend we can arrange them in a json format and return it to the client
Detailed component design
In our traffic estimates, we mentioned that we will have almost 50 million read requests a day, that is a lot of requests. Assuming that a server can handle 100 reads a second, we will need at least 6 servers to handle the demand. Assuming another 10 writes a second, we will need another server to handle this. We will also need backup servers, meaning a total of about 10 servers in our system.
Between the client and the servers, we will have a load balancer to distribute the load between the servers
In terms of DB sharding, what would be the most appropriate way to distribute the load in a DB. In our DB structure, the most appropriate way to distribute the load is in terms of the PostID in which the comment or reply is being made on.
We will have multiple DBs to ensure scalability and robustness.
Trade offs/Tech choices
The bad thing about how we shard our data is that if there are posts that are very popular and a lot of people are making comments and replies and reading from that post, then our DB will be overloaded. Hence what we can consider is instead of distributing our data in terms of PostID, we can distribute them in terms of the Root CommentID, meaning that instead of creating an edge between PostID and CommentID when we do POST comment, we will simply place the CommentID in the GraphDB, then when you reply a comment, we can create an edge between the commentID to the reply commentID.
This also means that we need a way to retrieve the root comments from a post, with this we could use a DocumentDB to keep track of the PostID and an array of root comments on a post, this documentDB can then be sharded by PostID using consistent hashing
Failure scenarios/bottlenecks
We discussed the possibility of popular posts in the previous section.
What if there is a spammer that tries to reply to their own comment multiple times? resulting in a hugely nested comment section, overloading our GraphDB and creating a bad user experience.
firstly, what we could do is to only retrieve the first 3 nested replies on a comment, this prevents the UI from displaying all the nested comments, which is not good for user experience, a load more option will be available if a user wants to see more replies.
Secondly, we could implement a RateLimiter for users, we could also limit the number of nested replies to just 1000.
Future improvements
Future improvements is to prevent the use of offensive language in the comment section, which will not be good for user experience as well. What we could do is to have an algorithm that detects offensive languages before inserting it into the DB, users who tries to comment using offensive language can subsequently be flagged by our system can banning them
得分: 9