设计一个键值存储系统

难度: medium

设计一个高性能的分布式键值存储系统,该系统提供快速的数据访问、可扩展性和高可用性。系统应支持set、get、delete以及条件更新等操作,并且具有最小的延迟。功能应包括跨多个节点的数据复制以确保容错能力,分区技术以均匀分布集群中的数据,以及持久化选项以保护数据。系统必须能够高效处理大量的读/写操作,并提供一致性和持久性保证。示例包括Redis或DynamoDB。

Solution

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

keys can be overwritten

values can be in any format - just a byte array

"key-level transactionality"

  • avoid conflicts on the same key

Non-Functional:

List non-functional requirements for the system...

100K reads/sec

10K writes/sec

read-heavy system

avg value: 1kb

highly available, low latency

tunable consensus

  • w + r > n for "strong consistency"
  • can update configurations for "weak consistency" (stale reads) or "eventual consistency"

value size limit of 1MB

Capacity estimation

Estimate the scale of the system you are going to design...

1e3 bytes/write * 1e4 writes/sec * 3e7 sec/yr => 3e14 bytes/yr => 100TB / yr

1e3 bytes/read * 1e5 reads/sec => 1e8 bytes/sec => 100 MB/sec

API design

Define what APIs are expected from the system...

write(key: []byte, value: []byte)

read(key)

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

distribute load evenly across servers via consistent hashing

each shard has 2 replicas, across multiple regions

use gossip protocol and quorum consensus to detect failures and elect new leaders during failover

w + r > n

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

client: http client

coordinator: "partition-aware proxy" to direct the write/read to the appropriate partition

distributed databases: each running a hash table, flushing writes to disk

flowchart TD
    B[client] --> C{server}
    C --> D[Database]
  

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

when a write is made, we write it to our commit log (WAL), providing REDO information, then to our memory buffer which is later flushed to disk (a SS Table or append-only file).

when a read is made, we check the cache first. If in the cache, return from cache. Else, leverage a bloom filter to find which SS Table it's in, then return it from there.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

each partition will have a cache and SS table to improve durability. ie: write-ahead logging

Uses a write-through cache to leverage caching and ensure consistency between the cache and the persistent storage.

For writes, if the requested key not in memory, leverage a bloom filter to narrow down which SSTable contains the key, then look up the key in that SS Table and return the data.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Tunable consistency with r + w > n

strong consistency: avoids stale reads

weak consistency: stale reads are acceptable

eventual consistency: weak consistent where not all replicas have to be consistent

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

With replication and failover, we can fail over to a new leader when the leader goes down. Majority quorum is usually used, via algorithm like paxos or raft, to elect the new leader.

consistent hashing ensures minimal data is moved when a new server is added/removed to/from the ring

If using weak consistency, leverage anti-entropy processes, or read-repair, to prevent replicas from getting out of sync and minimize stale reads.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Handling large key values - compression, streaming, chunking, indexing, etc


得分: 9