多线程Web刮擦



我编写了一个用于从Web获取链接的代码。运行此代码大约需要2:20分钟,因为它只是代码中的功能。我想提高效率。我考虑了多线程,但是我很难深入了解它,并将其应用于此代码

def get_manufacturer():
    manufacturers = requests.get("https://www.gsmarena.com/")
    res = re.findall(r"<li><a href="samsung-phones-9.php">.+n", manufacturers.text)
    manufacturer_links = re.findall(r"<li><a href="(.+?)">", res[0])
    final_list = []
    for i in range(len(manufacturer_links)):
        final_list.append("https://www.gsmarena.com/" + manufacturer_links[i])
        # find pages
        for i in final_list:
            req = requests.get(i)
            res2 = re.findall(r"<strong>1</strong>(.+)</div>", req.text)
            for k in res2:
                if k is not None:
                    pages = re.findall(r"<a href="(.+?)">.</a>", res2[0])
                    for j in range(len(pages)):
                        final_list.append("https://www.gsmarena.com/" + pages[j])
    return final_list

您可以在下面并行运行循环循环

import multiprocessing as mul
def calcIntOfnth(i,ppStr,c,znot):
pool = mul.Pool(mul.cpu_count())
results = pool.starmap(calcIntOfnth, [(i,ppStr,c,znot) for i in range(k)]) # other parameters are local to this statement i.e. ppStr,c,znot,k
pool.close()

您需要将您的for循环之一重写为函数,并使用Pool对象或其他类似方式并行运行它。

最新更新