Python多处理抓取,结果重复



我正在构建一个需要在大量网页上快速执行的scraper。下面代码的结果将是一个csv文件,其中包含链接列表(以及其他内容(。基本上,我创建了一个包含多个链接的网页列表,并为每个页面收集这些链接。

实现多处理会导致一些奇怪的结果,我无法解释。如果我运行这段代码,将池的值设置为1(因此,在没有多线程的情况下(,我得到的最终结果是,我有0.5%的重复链接(这很公平(。一旦我加快速度,将值设置为8、12或24,我就会在最终结果中获得大约25%的重复链接。

我怀疑我的错误在于我将结果写入csv文件的方式,或者我使用imap()函数的方式(imap_unorderedmap等也是如此(,这导致线程以某种方式访问可迭代传递的相同元素。有什么建议吗?

#!/usr/bin/env python
#  coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool
keyword = "keyword"
def openup():
global crawl_list
try:
### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
startpage = 1
## Get endpage
url0 = myurl0
r0 = requests.get(url0)
print "First request: "+str(r0.status_code)
tree = html.fromstring(r0.content)
endpage = tree.xpath("//*[@id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[@class='adroite']/a/text()")
print str(endpage[0]) + " pages found"
### Generate random sequence for crawling
crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
return crawl_list
except Exception as e:
### Catches openup error and return an empty crawl list, then breaks
print e 
crawl_list = []
return crawl_list
def worker_crawl(x):
### Open page
url_base = myurlbase
r = requests.get(url_base)
print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
while True:
if r.status_code == 200:
tree = html.fromstring(r.content)
### Get data 
titles = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
links = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/@href')
abstracts = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
footers = tree.xpath('//*[@id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
dates = []
pagenums = []
for f in footers:
pagenums.append(x)
match = re.search(r'| .+$', f)
if match:
date = match.group()
dates.append(date)
pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
return pageindex
else:
pageindex = [[str(r.status_code),"","","","",str(x)]]
return pageindex
continue
def mp_handler():
### Write down:
with open(keyword+'_results.csv', 'wb') as outcsv:
wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
wr.writeheader()
results = p.imap(worker_crawl, crawl_list)
for result in results:
for x in result:
wr.writerow({
#"keyword": str(keyword),
"title": x[0],
"link": x[1],
"abstract": x[2],
"footer": x[3],
"date": x[4],
"pagenum": x[5],
})
if __name__=='__main__':
p = ThreadPool(4)
openup()
mp_handler()
p.terminate()
p.join()

您确定页面在快速的请求序列中响应正确吗?我曾经遇到过这样的情况:如果请求很快,与请求间隔时间相比,被刮走的网站会做出不同的响应。Menaing,调试时一切都很顺利,但一旦请求快速有序,网站就决定给我一个不同的回应。除此之外,我想问您在非线程安全环境中写作的事实是否会产生影响:为了最大限度地减少对最终CSV输出的交互和数据问题,您可以:

  • 将wr.writers与要写入的行块一起使用
  • 使用threading.lock,如下所示:多个线程在Python中写入同一CSV

最新更新