使用文本抓取进行多处理

我想从页面中抓取<p>，因为会有几千个，我想使用多处理。但是，当我尝试将结果附加到某个变量时，它不起作用

我想将抓取的结果附加到data = []

我为一个基本网站做了一个url_common，因为有些页面不是以HTTP等开头的。

from tqdm import tqdm
import faster_than_requests as requests   #20% faster on average in my case than urllib.request
import bs4 as bs
def scrape(link, data):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(i))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
data.append(p.text)

上面不起作用，因为map()不接受像上面这样的功能

我尝试以另一种方式使用它：

def scrape(link):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(i))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
print(p.text)
from multiprocessing import Pool
p = Pool(10)
links = ['link', 'other_link', 'another_link']
data = p.map(scrape, links)

使用上述函数时出现此错误：

Traceback (most recent call last):
File "C:ProgramDataAnaconda3libmultiprocessingprocess.py", line 297, in _bootstrap
self.run()
File "C:ProgramDataAnaconda3libmultiprocessingprocess.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "C:ProgramDataAnaconda3libmultiprocessingpool.py", line 110, in worker
task = get()
File "C:ProgramDataAnaconda3libmultiprocessingqueues.py", line 354, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'scrape' on <module '__main__' (built-in)>

我还没有找到一种方法来做到这一点，以便它使用Pool，同时将抓取的结果附加到给定变量

编辑

我稍微改变一下，看看它在哪里停止：

def scrape(link):
for i in tqdm(link):
if i[:3] !='htt':
url_common = 'https://www.investing.com/'
else:
url_common = ''
try: #tries are always halpful with url as you never know
ht = requests.get2str(url_common + str(i))
except:
pass
print('works1')
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
print('works2')
for p in paragraphs:
print(p.text)
links = ['link', 'other_link', 'another_link']
scrape(links) 
#WORKS PROPERLY AND PRINTS EVERYTHING 
if __name__ == '__main__':
p = Pool(5)
print(p.map(scrape, links))
#DOESN'T WORK, NOTHING PRINTS. Error like above

您错误地使用了map函数。

它遍历可迭代对象的每个元素，并调用每个元素上的函数。

您可以看到映射函数执行以下操作：

to_be_mapped = [1, 2, 3]
mapped = []
def mapping(x): # <-- note that the mapping accepts a single value
return x**2
for item in to_be_mapped:
res = mapping(item)
mapped.append(res)

因此，为了解决您的问题，请删除最外层的for循环，因为迭代由map函数处理

def scrape(link):
if link[:3] !='htt':
url_common = 'https://www.common_url.com/'
else:
url_common = ''
try: 
ht = requests.get2str(url_common + str(link))
except:
pass
parsed = bs.BeautifulSoup(ht,'lxml')
paragraphs = parsed.find_all('p')
for p in paragraphs:
print(p.text)

相关内容

最新更新

热门标签：