我用python和BeautifulSoup结合编写了一个脚本,提取在亚马逊搜索框中提供一些ISBN编号后填充的书名。我从一个名为amazon.xlsx
的excel文件中提供了这些ISBN编号。当我尝试使用以下脚本时,它会相应地解析标题,并按预期写回excel文件。
我放置isbn数字以填充结果的链接。
import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
wb = load_workbook('amazon.xlsx')
ws = wb['content']
def get_info(num):
params = {
'url': 'search-alias=aps',
'field-keywords': num
}
res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)
soup = BeautifulSoup(res.text,"lxml")
itemlink = soup.select_one("a.s-access-detail-page")
if itemlink:
get_data(itemlink['href'])
def get_data(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
try:
itmtitle = soup.select_one("#productTitle").get_text(strip=True)
except AttributeError: itmtitle = "NA"
print(itmtitle)
ws.cell(row=row, column=2).value = itmtitle
wb.save("amazon.xlsx")
if __name__ == '__main__':
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
get_info(val)
但是,当我尝试使用multiprocessing
执行相同操作时,我会出现以下错误:
ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined
对于multiprocessing
,我在脚本中带来的更改是:
from multiprocessing import Pool
if __name__ == '__main__':
isbnlist = []
for row in range(2, ws.max_row + 1):
if ws.cell(row=row,column=1).value==None:break
val = ws["A" + str(row)].value
isbnlist.append(val)
with Pool(10) as p:
p.map(get_info,isbnlist)
p.terminate()
p.join()
我尝试过的几种ISBN:
9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461
如何使用multiprocessing
消除该错误并获得所需结果
在get_data()
中引用全局变量row
是没有意义的,因为
-
它是全局的,不会在多处理池中的每个"线程"之间共享,因为它们实际上是独立的python进程,不共享全局。
-
即使他们这样做了,因为在执行
get_info()
之前要构建整个ISBN列表,所以row
的值将始终是ws.max_row + 1
,因为循环已经完成。
因此,您需要将行值作为传递给p.map()
的第二个参数的数据的一部分提供。但是,即使你要这样做,由于Windows文件锁定、竞争条件等原因,从多个进程写入并保存电子表格也是一个坏主意。你最好只通过多处理构建标题列表,然后在完成后写一次,如下所示:
import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pool
def get_info(isbn):
params = {
'url': 'search-alias=aps',
'field-keywords': isbn
}
res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)
soup = BeautifulSoup(res.text, "lxml")
itemlink = soup.select_one("a.s-access-detail-page")
if itemlink:
return get_data(itemlink['href'])
def get_data(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
try:
itmtitle = soup.select_one("#productTitle").get_text(strip=True)
except AttributeError:
itmtitle = "NA"
return itmtitle
def main():
wb = load_workbook('amazon.xlsx')
ws = wb['content']
isbnlist = []
for row in range(2, ws.max_row + 1):
if ws.cell(row=row, column=1).value is None:
break
val = ws["A" + str(row)].value
isbnlist.append(val)
with Pool(10) as p:
titles = p.map(get_info, isbnlist)
p.terminate()
p.join()
for row in range(2, ws.max_row + 1):
ws.cell(row=row, column=2).value = titles[row - 2]
wb.save("amazon.xlsx")
if __name__ == '__main__':
main()