难度: hard
Solution
System requirements
Functional:
The web crawler should be able to download html pages of crawled websites
The webcrawler should be able to store the download html web pages in appropriate storage
The webcrawler should not crawl malicious or blacklisted websites
The webcrawler should be polite when crawling websites so as to avoid overwhelming them with traffic.
The webcrawler should be able to prioritize urls to crawl effectively.
The webcrawler should have a record of previously crawled urls so as to avoid repeating crawling the same url.
The webcrawler should have a load balancer to distribute crawling jobs among various application servers effectively
The webcrawler should be have cache servers to cache robot.txt files to prevent the server from downloading them each time.
Non-Functional:
The webcrawler should be highly available inorder to crawl enough webpages per unit time.
The webcrawler should be able to re-crawl web pages periodically inorder to update content about the websites
The webcrawler should avoid saving duplicate information inorder to save database space.This would also save system resources that would otherwise be used to recrawl the website.
Capacity estimation
The webcrawler should crawl a total of 10,000,000,000 billion urls. Of the 10,000,000,000 billion urls, 1 billion urls should be unique.
The size of each crawled html page is approximated to be 2mb. PHence the total size of all crawled web pages is 20,000,000,000 mb or 20 PB
The checksum value for each downloaded html page will be approximately 8 bytes, that is 80,000,000,000 bytes or 80 GB
Assuming that all this 10,000,000,000 billion urls should be crawled within 6 months , the rate per second for the number of urls to be crawled is:
10,000,000,000 urls / (90 days * 86400 sec) =
1287 urls per second.
The total size of all downloaded html pages per second is
1287 * 2mb = 2574mb hence the bandwith for downloading html pages should be atleast 2.574 GB / second.
For each of the crawled urls the title and the description of the website will be saved and also the last time the url was crawled. We will assume that the average title size is 100 bytes and the average description size is 500 bytes and we will use 10 bytes to save the last time that particular url was crawled. Then for all urls the total amount of metadata saved is approximately:
10,000,000,000 * (500 + 100 + 10) = 6.1GB
API design
download_webpage(
method = GET
body = {
url: url_to_be_downloaded
}
)
parse_webcontent(
method = POST
body = {
content: downloaded_html_page
}
)
url_filter(
method = POST
body = {
urls: found_urls
}
)
add_to_frontier(
method = POST
body = {
urls: urls_to_add_to_frontier
}
)
check_if_already_crawled(
method = POST
body = {
urls: filtered_urls
}
)
Database design
For storing crawled urls metadata, we will use PostgreSQL, a relational database. The metadata will be the urls' hash value, page title, page description and the last time the page was updated.
The hash value will be used to map urls to various database servers depending on the hash value. A consistent hashing algorithm i.e sha-256 together with conflict resolution will be used to ensure that the hashing of urls is consistent. Sha-256 generates a 64bit has value which makes it very hard to have collisions. We will only use the first 10 characters which will give us over 100 billion combinations.
We will also replicate each database server once to ensure that their is a failover slave database for each database server incase the master database server fails. The master database will also take periodic snapshots of the database state which will allow the database server to recover from database failures easily.
For storing the parsed html content, images and other mimetypes we will use amazon s3 buckets as our blob storage. Each url in the Postgres database will have a link to various files associated with in it in s3 buckets.
High-level design
URL-Frontier
The url frontier contains the list of all urls to be crawled
URL Prioritizer
The url prioritizer arranges the urls in queues accoding to the priority . The queue with high
priority are selected at a higher rate compared to queues will low priority. This makes sures that top sites are crawled more often compared to less populary sites
Load Balancer
The load balancer is responsible for distributing traffic among the worker servers and each application server instance is responsible for handling its own queue depending on the url priority.
Cache Servers
Cache servers store recently downloaded robot.txt files inorder to prevent them from being downloaded again when the site is being crawled . This increases the number of urls that are crawled per second
HTML Downloader
The html downloader is responsible for downloading crawled html pages.
Content Parser
The content parser is responsible for converting the downloaded html pages into the appropriate format
URL Filter
The url filter then extract's urls from the parsed content and filters them depending on creterias such as whether they have been blacklisted or crawled before. The urls that have not been crawled before and have not been blacklisted are then added to frontier.
Database
The metadata for the crawled url is then stored in postgres database with mimetypes i.e images are then uploaded to blob storage i.e s3 buckets. A reference to the uploaded asset / mimetype is then stored in the postgres database together with the other url metadata and page information.
flowchart TD B[client] --> C{server} C --> D[Database]
Request flows
The url frontier sends a get request to the DNS Server , the DNS Server then returns the ip address of the url's webserver.
The crawler then send a get request to the web server via the obtained IP address using networking protocols such as HTTP. If the web server is online, the webserver returns a webpage as response in form of html content. The crawler then downloads the content of the html page, the html page is then parsed into a content parser which converts it to the appropriate format after which then urls are extracted from the parsed contend.
The extracted metadata together with the crawled url are then sent to the database server i.e via post request for storing
Mimetypes found in the parsed html page are then uploaded to amazon s3 buckets via a post request
Detailed component design
URL Frontier:
The url frontier contains milions if urls to be crawled. This urls to be crawled have different priorities and rae passed to a URL prioritizer where they are placed in queues according to their priorities. The URL frontier stores majority of these urls in disk and a smaller but significant portion are stored in memory. The urls are then loaded periodically from disks as more urls are being crawled. The URL Frontier then sends a get request to the DNS Server upon which the DNS Server responds with the ip address of the webserver of the url to be crawled. This ip address is then passed to the crawler for crawlling.
Load Balancers:
Since thousands of urls need to be crawled each second, traffic or worker jobs are needed to be distributed among several application servers. The load balancers are responsible for distributing jobs among application servers. A consistent algorithm such as weighted round robin is responsible for distributing traffic among the workers depending on the work loads and server status. Load balancers ensure that incase a server fails , its workload is transferred to another server , making sure that the system stays in operation and that no jobs are lost.
Trade offs/Tech choices
Some pages with similar content will be left out so as to avoid a lot of duplication which will lead to more database storage being occupied by similar or same content.
Some urls that may seem malicious or are black listed will not be crawled event though they may have valuable content.
URLS with higher priority will be crawled more often than urls will lower priority.
Sites with no robots.txt file will be crawled in all pages
Failure scenarios/bottlenecks
Multiple database servers are used to facilitate easier retrieval and writing of data .
Backup slave databases are used as failover mechanisms incase the master databases fail.
Load balancers are used to distribute traffic / work load among several application servers. Incase an application server is down, its work load / traffic is distributed among several other servers or transferred to another server
Cache servers will be used to cache robots.txt file ensuring faster navigation of websites without waiting for robots.txt files to be downloaed for instruction
Future improvements
I would major on politeness in crawling of webpages inorder to ensure that web servers are not overwhelmed by the number of requests. I would delay each request / crawling of each web page in a url by atleast 20 seconds inorder to ensure appropriate response times.
I would also increase the number of servers to ensure urls are crawled more frequently.
得分: 9