设计一个标签服务

难度: easy

开发一个可扩展、灵活且用户友好的标签服务，使得通过标签能高效地对数字项目进行分类和检索。系统应支持对各种类型内容的标记，允许标签的易管理性和可搜索性，并提供标签建议和标准化的机制。

Solution

System requirements

Functional:

List functional requirements for the system (Ask interviewer if stuck)...

Users are able to add/delete/update tags
Tags should support different data types
Support search, filter
Support tags recommendation (optional)
Support tags normalization

Non-Functional:

Highly available
easy to scale
High performance: ensure efficient fast retrieval and search

Capacity estimation

Estimate the scale of the system you are going to design...

Assume that there are 10,000 DAU, each user has 500 tags on average. So there are 500*10,000 = 5,000,000 tags in total

Assume that each user add 5 tags per day, then there are 50,000 tags added per day

if 20% of the datas added per day are updated, then 50,000 * 20% = 10,000 updates per day;

Assume each tag is 80 bytes, then we need at least 5,000,000 * 80 bytes = 0.4 Gb

API design

Define what APIs are expected from the system...

RESTful APIs:

@Create

void create(String tagName)

@Batch_Create

void batchCreate(List<String> tagNames)

@Update

void update(Tag tag)

@Delete

void delete(Tag tag)

@Get

List<Tag> getTags()

List<Tag> recommendTags(String input)

List<Tag> normalizeTags(List<Tag> tags)

List<Entity> searchByTags(List<Tag> tags)

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

Data model:

Tag {

id: int (autoincrement),

name: varchar,

metadata: varchar

}

TagCategory {

id: int,

name: varchar,

}

TagToCategoryMapping {

tagId: int,

categoryId: int

}

Choice of Database:

Relational database (eg, postgreSQL)

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...

graph TD;
    A[User Interface] -- HTTP Requests -->  B[API Gateway];
    B -- Queries -->  C[Tagging Service];
    C -- Manages Data -->  D[Database];
    C -- Utilizes -->  E[NLP Module];
    E -- Provides Suggestions -->  C;
    C -- Sends Notifications -->  F[Message Broker];
    F -- Handles Messaging -->  C;
    B -- Forwards Requests -->  G[Microservice 1];
    G -- Interacts with -->  D;
    B -- Forwards Requests -->  H[Microservice 2];
    H -- Interacts with -->  D;

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

- How to efficiently store and retrieve tags in db? (database schema design)

We store the tags in the relational database.

Indexing:

index the tag name with tagId (clustered index)
index the itemId in the mapping table for fast items retrieval

Data Storage mechanism:

Store metadata as JSON

- How to implement tag suggestions?

Using typeahead search.

Create a table to store popular or trending tags for generating suggestions:

tagFrequency {

id: int -> primary key,

tagName: varchar,

frequency: int

}

We need a tagSuggestionService to do the recommendation.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Trade-offs:

Consistency vs Performance: Consistent normalization may require more computational resources, potentially affecting system performance
Scalability vs flexibility

Tech Choices:

Database:
Relational db vs NoSQL: NoSQL is good at handling unstructured data, useful for storing diver tag structures
Use microservices architecture: could enhance scalability and maintainability
Utilize NLP for tag suggestions based on content analysis

Could leverage message broker such as Kafka to do the real-time updates for tag suggestions

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

SOP(single point of failure): do data replications. use master-slave servers

Utilize in-memory data stores like Redis for caching frequently accessed tags and metadata. Implement indexing on tag fields to speed up search operations. Utilize search technologies like Elasticsearch for efficient full-text search capabilities. Employ sharding techniques to distribute data across multiple nodes and balance the load effectively.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

得分: 8