设计在线剪贴板

难度: easy

设计一个类似于Pastebin的服务。Pastebin是一个网站,用户可以在这里将文本在线存储一定的时间。

Solution

System requirements

Functional:

  • Should be able to store texts with sizes up to 5KB
  • Should be able to store them for at 10 years
  • Should be able to access texts stored through APIs
  • Should be able to set privacy settings for the text stored, if public, anyone with the correct API key will be able to access the text, else if private, only the user can access the text

Non-Functional:

  • The service should be reliable, being able to store the the text for 10 years and not lose it
  • The service should be available, anyone anywhere at anytime can access the texts
  • The service should be robust, being able to handle hundreds of thousands of requests per second and still be able to return text relatively quickly with not too much latency

Capacity estimation

Storage estimations

  • Assuming that we receive 500,000 new texts to store per day and each text is 5KB
  • 1 Day = 500,000 * 5KB = 2500000KB = 2450MB = 2.5GB
  • 1 Year = 2.5 * 365 = 920 GB
  • 10 years = 9200 GB = 9TB

Traffic Estimations

  • Assuming a 1 : 10 write to read ratio
  • Read Traffic
  • 1 Day = 5,000,000 requests
  • 1 Hour = 210000 requests
  • 1 min = 3500 requests
  • 1 second = 60 requests = 300KB / s
  • Write Traffic
  • 1 second = 6 requests = 30KB / s

Cache Storage Estimations

  • We will most likely want to cache the top 20% of the most accessed texts in our DB in a month
  • Cache storage = 2450 MB * 30 * 20% = 15GB

API design

store_text(text, api_dev_key, privacy=private)

  • Stores the text in the PastBin, privacy is set to private by default. The API will return an API URL that would allow you to access your texts

retrieve_text(api_dev_key, unique_key)

  • Checks the privacy settings, if authorised, retrieve the text from DB

view_texts(api_dev_key)

  • Retrieves the list of texts that were stored by the user. You can also view the API URL that would allow you to access the texts

Database design

Lets define the data that is stored

Text

  • text
  • user
  • key
  • expiry_date
  • privacy

User

  • dev_key

There are relations between the user and text, but I feel that we need to ensure that our service is scalable (just in case our service gets more popular and more users use it), then we will need to use a DB that is easily scalable. In this case we could use a NoSQL Document DB to store the data.

In terms of caching, we could use a memcache or Redis to store the top 20% most popular texts, we will place this text between the backend server and the DB. We will also want to use a Least Recently Used algorithm when choosing to evict data from our cache. The key will be the unique_key for each text generated, and then the value will be the text.

High-level design

The client sends requests to the servers, which will retrieve data either from the cache or the DB then return it to the user

flowchart TD
    B[client] --> C{server}
    C --> D[Database]
  

Request flows

We will have a client that will send read requests to a backend server, the server will first check whether the requested text is in the cache, if not, look through the DB to retrieve it, update the cache, then return it to the user.

For write requests, we will generate a unique key for the text that we want to store, we store the key and the text in the DB. We will also store them in the cache, this assumes that the text that has been recently stored might be more frequently used

Detailed component design

You might first wonder, how can we generate a unique key for each of the texts in our DB? assuming 500,000 new texts per day, that would mean almost 2 billion unique_keys!

A potential method could be to have each data be associated with a unique number, 1,2,3.. and so on. However I dont think this is a good method because it means the keys are easily guessable, meaning someone might be able to access a text that would not otherwise be intended to be accessed

We can have a service that can continually generate a unique 10 character string to use as the key for the text, the string must first be checked to ensure it is not used in the DB. Assuming we are using a-z A-Z and 0-9 as the characters for the 6 character string, we can generate over 1x10^11 unique keys, which is more than enough for our DB. This method of generating unique keys can take quite a while especially since you have to ensure that the keys that are generated are actually unique, meaning you have to search through all the data to ensure this O(n) time. Hence, we could have a seperate server whos sole purpose is to generate unique keys, and store them in a unique key DB. Therefore when writing a text into our main DB, our main service can just pluck a unique key from the unique key DB.

We will most like have multiple DBs to store our data, this to ensure that we dont have 1 point of failure, as well as being able to distribute the load to multiple DBs. However, how can we distribute the data such that we distribute the load evenly?

One method could be to store the texts based on the user id, all the text that a user creates will be stores in 1 DB. However the bad part about this is that the texts are accessed by key, meaning the server would have to potentially search all DBs in order to find the correct text. A suggestion could be to distribute the texts by the first letter of the keys! We can have a hashing function that takes in the key and outputs the DB id for us to go and search for it.

We can also have a server whos purpose is to clear expired data from the DB

Trade offs/Tech choices

Since we are storing our data in our DB by their unique_keys, when a use calls our view_texts(api_dev_key) API, it means our service will have to comb through all the DBs, to find all data associated with a user.

However, I believe this is a necessary trade off because I am assuming that the ratio of retrieve_text to view_texts could be 1000 : 1, meaning it would be better if retrieve is more efficient than view_text

Failure scenarios/bottlenecks

Since we want our system to be robust, available and reliable, we will employ multiple servers and back up DBs to ensure that our data does not get lost.

Something that we dont want happening is for a user to make too many texts, perhaps this user has malicious intent and could purposely try to overload our service. Hence we might want to have a limiter that limits the number of texts generated per api_dev_key and the rate of generation of texts.

Future improvements

In the future we could allow users to set how long the expiry date for their data. Most of the time, users might not want to store their texts for more than a year, having this function could mean that more data is cleared from our DB, meaning we could potentially reduce our storage capacity


得分: 9