设计类似Yelp的本地服务平台

难度: hard

设计一个类似于 Yelp 的平台，使用户能够探索当地的设施，如餐馆、剧院和购物中心。此外，用户还可以为这些地点撰写评价。

Solution

System requirements

Functional:

Search nearby POI locations
User update POI (restaurants etc) details (Open hours, menu etc). User submit rating, reviews.
Users browse POI details.

Non-Functional:

Scalable.
Support high throughput API request of search POI.
Support high throughput of user browsing POI details.
User updating and submit rating is relatively small.
Search by text name???
Available.
No single-point of failure. Reliability.
Minimal downtime.
Search latency is low.
Consistency.
Eventual. Not strong consistency. Newly added restaurants could be shown, but could have some delay.

Capacity estimation

QPS

500M locations in total. DAU 100M.
Read and Search
5 searches per day. 500M / 100K = 5000 QPS. Peak 50K QPS.
20 browses per day. 20 * 100M / 100K = 20K QPS. Peak 200K QPS.
Update / Write
1 update per day. 100M / 100Ks = 1K RPS to update. 10K RPS peak.
Growth rate 15%

Capacity

For each location, UUID 16, lon+lat=16, rating 8, description 200 bytes, open/close, details 1Kb
1.5KB per location. 1.5KB * 500M = 750 GB.
Growth rate 10%.

API design

GET /search?user_location=[lon,lat]&radius=X&optional_filter={type, rating, status/open/close}&max=Y&order=asc/desc
POST/GET/DELETE /poi?field=X&value=Y
POST/GET/DELETE /review?poi=23idkfi&field=X&value=Y (rating, reviews)

Database design

Places Details DB [location_UUID, Geohash, lon, lat, name, type, rating, descriptions].

Index should be built on Geohash

SQL vs NoSQL.
No complex query needed.
High performance of read & write. Growth rate, we need good horizontal scalability.
So NoSQL is a better choice. Like MongoDB.
The NoSQL will distribute the data across different shards. So each shard could support a portion of 200K QPS Read and 10K RPS Write. MongoDB has its own replication as well for high availability.

High-level design

API Gateway / Reverse Proxy

TLS termination. Decrypt the request following HTTPS. Encrypt the response.
Directly return 503 error code, if Front-end service is slow, or down or completely unavailable.
Load balancing.

Web servers

Authentication & authorization.
Rate limiting.
Request validation and early return.
Re-route the requests to proper services.
Collect metrics for monitoring & alerts.
Cache (TODO).

Location services

Provide CRUD operations of places details. POST/GET/DELETE /poi?id=sdfildf&field=X&value=Y

Search services

Handle the request to enable users to find the nearby local establishments such as restaurants, theaters, and shopping centers

Places Details DB

Store the details of the places in this DB.

Index Service

Provide the in-memory index for fast lookup and search the nearby locations.

Index Building Service

Refresh the index in memory periodically

flowchart TD
    B[API Gateway / Reverse Proxy] --> C{Web server}
    C --> D[Location services]
    C --> E[Search services]
    D --> G[LRU Cache with TTL]
    G --> F[(Places Details DB)]
    E --> H[Index Service]
    H --> F
    II[Index Building Service] --> H
    D --> II

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Location Services. (do you want to dive into both? Or focus on one of these)

Request handling Flow.
Services Read DB. allocate 200 machines. 1K QPS, if one single machine is overwhelmed.
Services Write to DB.
Bottleneck is on the DB side.
200K QPS read. 10K RPS write.
DB [UUID, lon, lat, name, type, rating, descriptions, etc].
SQL vs NoSQL.
No complex query needed.
High performance of read & write. Growth rate, we need good horizontal scalability.
NoSQL is a better choice. Like MongoDB.
The NoSQL will distribute the data across different shards. So each shard could support a portion of 200K QPS R and 10K RPS W. MongoDB has its own replication as well for high availability.
Cache.
Good use case of cache. Popular area (NYC, SF) will have lots of queries.
Location UUID => {details}
Write through / cache-aside trade-off.
Cache-aside. return cached value, if it’s there. Otherwise read from DB and update cache. Update db only + TTL.
Write operation, updating db only. Fast
Stale value.
Cache frequently reads values.
Write through. return cached value, if it’s there. Otherwise read from DB and update cache. Update db + cache. + TTL.
Each writing operation involves mem and DB updates.
Many items written are not being queried. We could have a TTL for these values.
No data in-consistency issue.

Geohash

Project the earth into 2D grid, and then split evenly in both vertical and horizontal following the meridian and equator. Recursively we split each sub-grid and also encode as Geohash. 12-levels of geohashes.

From Index services: (Lon, Lat, Radius) => List of Location_UUID.

Steps:

user_location=[lon,lat] => geohash. This is the center of the circle and radius is specified. 9fy939fjdckd
Based on radius=X, find the right size of geohash bounding boxes in geohash covering the circle completely. Proper size is the key, if too large, we have too many locations to search and filter. If size is too small, lots of boxes to search.
The best practice is to find the bounding boxes, whose height & width is smallest, but larger than radius. Then we find all the neighboring boxes at the same Grid level. The eventual 9 boxes will cover the circle.
Level 5 => 9fy93.
Surrounding boxes of specified geohash. We get 9 geohashes. 9fy94, 9fy95
If we directly search in DB, Then we could do this,
Select * from POI where GeoHash LIKE (9fy934%), run this for all 9 boxes identified. Indexes on Geohash. B+ tree.
Range scanning of the whole indexes built on top of GeoHash. Prefix is 9fy93.
Then we need to filter the results by rating, type, open/close and return the results.
Optimizations #1.
Most likely 4~5 surrounding boxes will have no overlap with the circle.
We could calculate by finding the closest point of boxes to center of the circle. Check if it’s less than radius.
This will filter out half of the boxes in most cases.
Optimization #2.
Query are independent. Could initiate 9 queries in parallel. (5 in parallel)
Optimization #3. Put all indexes in the memory, by partition.
If we only keep [location UUID, lon, lat, Geohash, type, rating] in memory, no other details.
16bytes + 16 + 8 + 16bytes + = 60B X 500M = 30GB. Total size 45 GB => 300GB.
5000 QPS.
Both Index size and QPS could grow, so that one machine cannot handle. Could we distribute that among multiple nodes. The index is written on disk. But we could optimize and host it in memory. That will be very fast to look up.
Index services: (Lon, Lat) => Geohash => 9 Geohashes => List of Location_UUID and other info.
Then with the list of Location_UUID, we could query places DB to collect all the details, if the in-memory data is not sufficient.

Partition

Each non-leaf node represents an area. All the children nodes are close to each other. Could use geohash as partition key. Assign one node and its children into one machine. Eventually it’s possible that SF is in one node and NYC is in another.
Each node could be like
u33dbfcyeg
/ \
u33dbfcyegc u33dbfcyegc
/ \ / \
u33dbfcyegc6 u33dbfcyegcb u33dbfcyegcc u33dbfcyegcf
/ / / /
A [B, E,G,H] [C, I,L,M] D
This way we distribute the Geospatial index among different servers.
How to find the right servers?
We could have separated servers to route the request. It could be a B+ tree.
We need to find the nodes at the proper level as partition keys.
One the routing machine, the leaf node will have the IP address of the subtree.
Why Grid level 4/5/6 is proper level to assign to separated node?
Should be low enough, each sub-tree doesn’t have very large number of nodes. Should be high enough, Can be finished within a limited number of machines.
Assign a non-leaf node at Grid level 4/5/6 to be on one machine.
[Most searches are at level 4/5/6, one partition could finish the search. No need to merge the results from multiple machines. This number also needs some fine tuning, such that each partition does not have lots of POI, not overwhelmed. Each partition represents a logic area for easier search]
Issues.
Imbalanced. Uneven distribution. NYC has more density.
When we build the B+ tree, the non-leaf node could track the number of POI in the area geohash maps to.
If it’s larger than the threshold, we could use a finer level of granularity. For example, most of the areas are at level 4, but NYC is divided into sub-areas at level 6 on different machines.
Benefits
Each search is on a smaller scale.
Parallel search.
Keep them in memory, b/c it’s smaller. Not on disk.
This helps achieve better availability. Even if one partition died, others partition could still work.

Replication

Replicas. We could have replication of each partition. This helps achieve high availability (no single point of failure). Contribute to better performance (throughput).
Reduce the traffic load on one machine, reduce the possibility of hot shards.
Also helps balance the traffic among different machines. Example, RF=4
M1. G1-R1, G2-R3, G3-R2
M2. G1-R2, G2-R4, G3-R3
M3. G1-R3, G2-R1, G3-R4
M4. G1-R4, G2-R2, G3-R1

How write is updated in DB and cache.

We use primary-secondary replica sync. Eventually consistent.
Search services use multiple read replicas (secondary). Locations DB is primary.
Several options to refresh indexes in memory.
Periodically (Every half hour) we fetch the data from DB. and perform hot swap. Maintains multiple instances/replicas. For the instances to be updated, it will be excluded from serving read requests, but updated with the latest data. This way we have minimal impact on the read requests.
For each request, we update the index in memory. That will require a lock on the data structure being touched and interrupt read requests.

Cache.

Different layer of cache.
Query-level: Popular search cache [Geohash, rating, types => search results] Redis. If we get the Geohash already, and find rating, types are matching. We could directly return, or we could filter on top of the results. This could provide stale data. Cached items should have TTL.
Place DB could have a cache on top. Hot places will be cached. Redis cache [location UUID, (Places List)]. Evict via LRU cache. TTL.

Data Consistency

DB primary-secondary replicas. There could be some delay. So secondary replica could have stale data.
Query-level Cache being used in search services could have stale data.
LRU Cache on top of Place DB could have stale data.
Index services return directly the data, which could have stale data.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

故障场景/瓶颈

数据库中的某些节点发生故障。主从复制可以帮助恢复。
索引服务也有复制和分区。如果某些机器宕机,其他副本可以提供服务。
索引构建服务负责定期热交换索引。它还可以与主副本保持心跳。如果主副本宕机,从副本可以被提升为主副本。

未来改进

监控系统

应该监控系统组件。我们可以收集所有组件中的计数器指标,如缓存命中率。跟踪每次热交换时的差异。这将有助于微调内存中刷新索引的频率。

警报

设置关键组件的CPU/内存阈值,以检测热分区/机器。

得分: 9