多线程网页爬虫

难度: Medium

Submission

运行时间: 206 ms

内存: 0.0 MB

# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
#class HtmlParser(object):
#    def getUrls(self, url):
#        """
#        :type url: str
#        :rtype List[str]
#        """
from queue import Queue
from concurrent.futures import ThreadPoolExecutor
import threading
import concurrent

def get_hostname(url):
    # url = url[len('http://'):]
    # hostname = f'http://{url.split("/")[0]}'
    # return hostname
    return url.split('/')[2]


class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        hostname = get_hostname(startUrl)

        seen = set()
        seen.add(startUrl)

        to_visit = [startUrl]
        res = [startUrl]

        while(to_visit):
            futures = []
            # Explore candidates of ALL urls in the queue, with parallelism.
            with ThreadPoolExecutor(max_workers=16) as executor:
                for url in to_visit:
                    futures.append(executor.submit(htmlParser.getUrls, url))
            
            to_visit = []
            for future in concurrent.futures.as_completed(futures):
                for c in future.result():
                    if c not in seen and get_hostname(c) == hostname:
                        seen.add(c)
                        res.append(c)
                        to_visit.append(c)
        return res

Explain

该题解采用了多线程来加速网页爬虫的过程。首先，定义了一个帮助函数 `get_hostname` 用来从 URL 中提取主机名。在主函数 `crawl` 中，首先使用 `get_hostname` 得到初始 URL 的主机名。然后，使用一个集合 `seen` 来跟踪已经访问过的 URL，避免重复访问。接下来，使用列表 `to_visit` 来存储当前层待访问的 URL，并使用列表 `res` 来存储所有被访问过的 URL。对于 `to_visit` 中的每个 URL，使用线程池 `ThreadPoolExecutor` 来并行获取每个 URL 链接到的新 URL。对于每个新 URL，如果它未被访问过且与起始 URL 属于同一主机，就将其添加到 `seen`，`res`，以及下一轮的 `to_visit` 中。这个过程重复进行，直到没有新的 URL 可以访问。

时间复杂度: O(n^2)

空间复杂度: O(n)

# 定义获取主机名的函数
from queue import Queue
from concurrent.futures import ThreadPoolExecutor
import threading
import concurrent

def get_hostname(url):
    # 从 URL 中提取主机名部分
    return url.split('/')[2]


class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        # 获取起始 URL 的主机名
        hostname = get_hostname(startUrl)

        # 初始化已访问集合和结果列表
        seen = set()
        seen.add(startUrl)

        # 初始化待访问列表
        to_visit = [startUrl]
        res = [startUrl]

        # 当还有未访问的 URL 时继续
        while(to_visit):
            futures = []
            # 使用线程池并行处理每个 URL
            with ThreadPoolExecutor(max_workers=16) as executor:
                for url in to_visit:
                    futures.append(executor.submit(htmlParser.getUrls, url))
            
            # 清空待访问列表以准备下一轮
            to_visit = []
            # 等待所有线程处理完成并处理结果
            for future in concurrent.futures.as_completed(futures):
                for c in future.result():
                    # 检查新 URL 是否未被访问且属于同一主机
                    if c not in seen and get_hostname(c) == hostname:
                        seen.add(c)
                        res.append(c)
                        to_visit.append(c)
        return res

Explore

在多线程环境中，确保对`seen`集合的访问是线程安全的，可以采用几种策略。一种是使用锁（如`threading.Lock()`）来控制对`seen`的访问，确保在同一时间只有一个线程可以修改`seen`集合。另一种方法是使用线程安全的数据结构，如`queue.Queue`或`concurrent.futures`中的`ConcurrentHashMap`等。在提供的代码中，可以通过在修改`seen`集合前获取一个锁，并在修改完成后释放该锁来实现线程安全。

选择`ThreadPoolExecutor`的`max_workers`参数设为16通常基于系统的CPU核心数和预期的I/O等待时间。选择适当的线程数可以使得CPU和I/O资源得到有效利用，而不至于造成过度的线程切换和资源争抢。过多线程确实可能导致资源竞争增加，特别是当线程数远大于处理器核心数时。在实际应用中，应该根据目标系统的具体配置和任务特性来调整`max_workers`的值，可能需要通过实验来找到最优的线程数。

在多线程环境下，确保`to_visit`列表不重复添加相同的URL需要依赖于`seen`集合的线程安全操作。在每个线程中解析URL前，应先检查该URL是否已存在于`seen`中，这一检查和添加操作应当是原子的，即在获取到锁之后进行检查并添加操作，然后再释放锁。这样可以确保不会有多个线程同时将同一个URL添加到`to_visit`列表中。