使用文本抓取进行多处理



我想从页面中抓取<p>,因为会有几千个,我想使用多处理。但是,当我尝试将结果附加到某个变量时,它不起作用

我想将抓取的结果附加到data = []

我为一个基本网站做了一个url_common,因为有些页面不是以HTTP等开头的。

from tqdm import tqdm
import faster_than_requests as requests   #20% faster on average in my case than urllib.request
import bs4 as bs
def scrape(link, data):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(i))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
data.append(p.text)

上面不起作用,因为map()不接受像上面这样的功能

我尝试以另一种方式使用它:

def scrape(link):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(i))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
print(p.text)
from multiprocessing import Pool
p = Pool(10)
links = ['link', 'other_link', 'another_link']
data = p.map(scrape, links) 

使用上述函数时出现此错误:

Traceback (most recent call last):
File "C:ProgramDataAnaconda3libmultiprocessingprocess.py", line 297, in _bootstrap
self.run()
File "C:ProgramDataAnaconda3libmultiprocessingprocess.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 110, in worker
task = get()
File "C:ProgramDataAnaconda3libmultiprocessingqueues.py", line 354, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'scrape' on <module '__main__' (built-in)>

我还没有找到一种方法来做到这一点,以便它使用Pool,同时将抓取的结果附加到给定变量

编辑

我稍微改变一下,看看它在哪里停止:

def scrape(link):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.investing.com/'
else:
url_common = ''
try: #tries are always halpful with url as you never know
ht = requests.get2str(url_common + str(i))
except:
pass
print('works1')
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
print('works2')
for p in paragraphs:
print(p.text)
links = ['link', 'other_link', 'another_link']
scrape(links) 
#WORKS PROPERLY AND PRINTS EVERYTHING 
if __name__ == '__main__':
p = Pool(5)
print(p.map(scrape, links))
#DOESN'T WORK, NOTHING PRINTS. Error like above

您错误地使用了map函数。

它遍历可迭代对象的每个元素,并调用每个元素上的函数。

您可以看到映射函数执行以下操作:

to_be_mapped = [1, 2, 3]
mapped = []
def mapping(x): # <-- note that the mapping accepts a single value
return x**2
for item in to_be_mapped:
res = mapping(item)
mapped.append(res)

因此,为了解决您的问题,请删除最外层的for循环,因为迭代由map函数处理

def scrape(link):
if link[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(link))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
print(p.text)

相关内容

  • 没有找到相关文章