设计一个网页爬虫

难度: hard

设计一个网页爬虫，其任务是有系统地浏览和下载全球互联网上的内容。

Solution

System requirements

Functional:

The web crawler should be able to download html pages of crawled websites

The webcrawler should be able to store the download html web pages in appropriate storage

The webcrawler should not crawl malicious or blacklisted websites

The webcrawler should be polite when crawling websites so as to avoid overwhelming them with traffic.

The webcrawler should be able to prioritize urls to crawl effectively.

The webcrawler should have a record of previously crawled urls so as to avoid repeating crawling the same url.

The webcrawler should have a load balancer to distribute crawling jobs among various application servers effectively

The webcrawler should be have cache servers to cache robot.txt files to prevent the server from downloading them each time.

Non-Functional:

The webcrawler should be highly available inorder to crawl enough webpages per unit time.

The webcrawler should be able to re-crawl web pages periodically inorder to update content about the websites

The webcrawler should avoid saving duplicate information inorder to save database space.This would also save system resources that would otherwise be used to recrawl the website.

Capacity estimation

The webcrawler should crawl a total of 10,000,000,000 billion urls. Of the 10,000,000,000 billion urls, 1 billion urls should be unique.

The size of each crawled html page is approximated to be 2mb. PHence the total size of all crawled web pages is 20,000,000,000 mb or 20 PB

The checksum value for each downloaded html page will be approximately 8 bytes, that is 80,000,000,000 bytes or 80 GB

Assuming that all this 10,000,000,000 billion urls should be crawled within 6 months , the rate per second for the number of urls to be crawled is:

10,000,000,000 urls / (90 days * 86400 sec) =

1287 urls per second.

The total size of all downloaded html pages per second is

1287 * 2mb = 2574mb hence the bandwith for downloading html pages should be atleast 2.574 GB / second.

For each of the crawled urls the title and the description of the website will be saved and also the last time the url was crawled. We will assume that the average title size is 100 bytes and the average description size is 500 bytes and we will use 10 bytes to save the last time that particular url was crawled. Then for all urls the total amount of metadata saved is approximately:

10,000,000,000 * (500 + 100 + 10) = 6.1GB

API design

download_webpage(

method = GET

body = {

url: url_to_be_downloaded

}

)

parse_webcontent(

method = POST

body = {

content: downloaded_html_page

}

)

url_filter(

method = POST

body = {

urls: found_urls

}

)

add_to_frontier(

method = POST

body = {

urls: urls_to_add_to_frontier

}

)

check_if_already_crawled(

method = POST

body = {

urls: filtered_urls

}

)

Database design

For storing crawled urls metadata, we will use PostgreSQL, a relational database. The metadata will be the urls' hash value, page title, page description and the last time the page was updated.

The hash value will be used to map urls to various database servers depending on the hash value. A consistent hashing algorithm i.e sha-256 together with conflict resolution will be used to ensure that the hashing of urls is consistent. Sha-256 generates a 64bit has value which makes it very hard to have collisions. We will only use the first 10 characters which will give us over 100 billion combinations.

We will also replicate each database server once to ensure that their is a failover slave database for each database server incase the master database server fails. The master database will also take periodic snapshots of the database state which will allow the database server to recover from database failures easily.

For storing the parsed html content, images and other mimetypes we will use amazon s3 buckets as our blob storage. Each url in the Postgres database will have a link to various files associated with in it in s3 buckets.

High-level design

URL-Frontier

The url frontier contains the list of all urls to be crawled

URL Prioritizer

The url prioritizer arranges the urls in queues accoding to the priority . The queue with high

priority are selected at a higher rate compared to queues will low priority. This makes sures that top sites are crawled more often compared to less populary sites

Load Balancer

The load balancer is responsible for distributing traffic among the worker servers and each application server instance is responsible for handling its own queue depending on the url priority.

Cache Servers

Cache servers store recently downloaded robot.txt files inorder to prevent them from being downloaded again when the site is being crawled . This increases the number of urls that are crawled per second

HTML Downloader

The html downloader is responsible for downloading crawled html pages.

Content Parser

The content parser is responsible for converting the downloaded html pages into the appropriate format

URL Filter

The url filter then extract's urls from the parsed content and filters them depending on creterias such as whether they have been blacklisted or crawled before. The urls that have not been crawled before and have not been blacklisted are then added to frontier.

Database

The metadata for the crawled url is then stored in postgres database with mimetypes i.e images are then uploaded to blob storage i.e s3 buckets. A reference to the uploaded asset / mimetype is then stored in the postgres database together with the other url metadata and page information.

flowchart TD
    B[client] --> C{server}
    C --> D[Database]

Request flows

The url frontier sends a get request to the DNS Server , the DNS Server then returns the ip address of the url's webserver.

The crawler then send a get request to the web server via the obtained IP address using networking protocols such as HTTP. If the web server is online, the webserver returns a webpage as response in form of html content. The crawler then downloads the content of the html page, the html page is then parsed into a content parser which converts it to the appropriate format after which then urls are extracted from the parsed contend.

The extracted metadata together with the crawled url are then sent to the database server i.e via post request for storing

Mimetypes found in the parsed html page are then uploaded to amazon s3 buckets via a post request

Detailed component design

URL Frontier:

The url frontier contains milions if urls to be crawled. This urls to be crawled have different priorities and rae passed to a URL prioritizer where they are placed in queues according to their priorities. The URL frontier stores majority of these urls in disk and a smaller but significant portion are stored in memory. The urls are then loaded periodically from disks as more urls are being crawled. The URL Frontier then sends a get request to the DNS Server upon which the DNS Server responds with the ip address of the webserver of the url to be crawled. This ip address is then passed to the crawler for crawlling.

Load Balancers:

Since thousands of urls need to be crawled each second, traffic or worker jobs are needed to be distributed among several application servers. The load balancers are responsible for distributing jobs among application servers. A consistent algorithm such as weighted round robin is responsible for distributing traffic among the workers depending on the work loads and server status. Load balancers ensure that incase a server fails , its workload is transferred to another server , making sure that the system stays in operation and that no jobs are lost.

Trade offs/Tech choices

Some pages with similar content will be left out so as to avoid a lot of duplication which will lead to more database storage being occupied by similar or same content.

Some urls that may seem malicious or are black listed will not be crawled event though they may have valuable content.

URLS with higher priority will be crawled more often than urls will lower priority.

Sites with no robots.txt file will be crawled in all pages

Failure scenarios/bottlenecks

Multiple database servers are used to facilitate easier retrieval and writing of data .

Backup slave databases are used as failover mechanisms incase the master databases fail.

Load balancers are used to distribute traffic / work load among several application servers. Incase an application server is down, its work load / traffic is distributed among several other servers or transferred to another server

Cache servers will be used to cache robots.txt file ensuring faster navigation of websites without waiting for robots.txt files to be downloaed for instruction

Future improvements

I would major on politeness in crawling of webpages inorder to ensure that web servers are not overwhelmed by the number of requests. I would delay each request / crawling of each web page in a url by atleast 20 seconds inorder to ensure appropriate response times.

I would also increase the number of servers to ensure urls are crawled more frequently.

得分: 9