设计一个嵌套评论系统

难度: easy

开发一个可扩展且高效的嵌套评论管理系统,使用户能够通过发表评论及以层级方式回复其他评论来参与讨论。该系统应支持实时更新,保持对话的顺序和结构,并确保在各种设备上提供流畅的用户体验。

Solution

System requirements

Functional:

  • The comments will be associated with a post
  • The comments will be ordered by time
  • Comments will contain information about the user, the time, and the comment that it is replying to, if relevant
  • You should be able to reply to other comments

Non-Functional:

  • System should be able to support up to 1000 nested comments
  • Comments section should be retrieved with as little latency as possible

Capacity estimation

Assumptions:

  • 10:1 read to write ratio
  • 5 million new comments a day.
  • Stores each comment for 10 years.
  • Assuming a comment has an average of 10 replies each

Comment meta data

  • PostID (8 bytes)
  • CommentID (8 Bytes)
  • UserID (8 Bytes)
  • Content (1KB)
  • Time (8 bytes)
  • Array of replies (80 bytes)

Storage Estimations:

  • 1 day -> 5 million * (8+8+8+8+1024+80) = 5.5GB a day
  • 1 year = 2 TB
  • 10 years = 20 TB

Bandwidth Estimations

  • Read requests
  • 50 million reads a day
  • 2.1 million reads an hour
  • 35,000 reads a minute
  • 600 reads a second
  • 666 KB per second
  • Write
  • 66.6 KB per second

Cache Estimations

  • Assuming that we want to cache 20% of the most popular data
  • If we are getting 50 million reads a day, we want to be able to cache 20% of 50 million -> 10 million requests
  • 10 million requests = 10 million * (8+8+8+8+1024+80) = 10.6 GB

API design

GET comments(postID)

  • Will be returned in a structural way where the comments are sorted in order of their time, then within each reply, it will also be sorted by the time

POST comment(postID)

  • Posts your comment on the post, your time will be recorded

POST reply(commentID)

  • Reply to a comment

Database design

We have two different types of database that we can use, lets see the pros and cons for both

Document DB

The first DB type is the DocumentDB, this is useful because our data (in particular the array of replies) is not structured, DocumentDB is useful for storing these types of unstructured data, it is easy to use and easy to query. However, DocumentDB is not useful when there are a lot of relations in the data, which is the case for our data.

GraphDB

The second type of DB is the GraphDB, this is useful when our data has a lot of deep relations. We can retrieve many all the replies in a post just from querying once, this is better compared to DocumentDB where you have to query the DB multiple times to get the necessary data.

The nature of our data is that it is interconnected, however, very rarely do we get super nested comments, although possible. Hence it is relational, but not too relational. In addition, we must ensure that the comments are retrieved in order of time, graphDB can do that as well. However we should minimize the quries to the DB as much as possible, hence because we can retrieve all the comments in a single query using GraphDB, we should use GraphDB

Hence, looking at the pros of GraphDB and the nature of our data, we will use GraphDB as our main database.

In terms of caching, we will use MemCache or Redis to keep the top 20% of comments. We should also keep comment section to indicate whether it has been modified (dirty bit), if so, we will have to query the DB again. We will use the Least Recently Used algorithm when evicting data from our Cache.

High-level design

Our client will point to our backend system which will point to the Cache and GraphDB

flowchart TD
    B[client] --> C{server}
    C --> D[Database]
  

Request flows

Scenario 1: Commenting on a post

Client calls the POST comment API to post a comment. The backend system will note the time that the request was made, in the DB, we will create a new edge from the postID to the comment

Scenario 2: Reply a comment

Client calls the POST reply API to post a reply. The backend system will note the time that the request was made, in the DB, we will create a new edge from the comment to the reply

Scenario 3: Retrieve comment section

Client calls the GET comment API to retrieve all the comments under a post by PostID. Using a single query, we can retrieve all the comments and the structure from the GraphDB. In the backend we can arrange them in a json format and return it to the client

Detailed component design

In our traffic estimates, we mentioned that we will have almost 50 million read requests a day, that is a lot of requests. Assuming that a server can handle 100 reads a second, we will need at least 6 servers to handle the demand. Assuming another 10 writes a second, we will need another server to handle this. We will also need backup servers, meaning a total of about 10 servers in our system.

Between the client and the servers, we will have a load balancer to distribute the load between the servers

In terms of DB sharding, what would be the most appropriate way to distribute the load in a DB. In our DB structure, the most appropriate way to distribute the load is in terms of the PostID in which the comment or reply is being made on.

We will have multiple DBs to ensure scalability and robustness.

Trade offs/Tech choices

The bad thing about how we shard our data is that if there are posts that are very popular and a lot of people are making comments and replies and reading from that post, then our DB will be overloaded. Hence what we can consider is instead of distributing our data in terms of PostID, we can distribute them in terms of the Root CommentID, meaning that instead of creating an edge between PostID and CommentID when we do POST comment, we will simply place the CommentID in the GraphDB, then when you reply a comment, we can create an edge between the commentID to the reply commentID.

This also means that we need a way to retrieve the root comments from a post, with this we could use a DocumentDB to keep track of the PostID and an array of root comments on a post, this documentDB can then be sharded by PostID using consistent hashing

Failure scenarios/bottlenecks

We discussed the possibility of popular posts in the previous section.

What if there is a spammer that tries to reply to their own comment multiple times? resulting in a hugely nested comment section, overloading our GraphDB and creating a bad user experience.

firstly, what we could do is to only retrieve the first 3 nested replies on a comment, this prevents the UI from displaying all the nested comments, which is not good for user experience, a load more option will be available if a user wants to see more replies.

Secondly, we could implement a RateLimiter for users, we could also limit the number of nested replies to just 1000.

Future improvements

Future improvements is to prevent the use of offensive language in the comment section, which will not be good for user experience as well. What we could do is to have an algorithm that detects offensive languages before inserting it into the DB, users who tries to comment using offensive language can subsequently be flagged by our system can banning them


得分: 9