我有一个数据库,里面有不同书籍的ISBN号。我用Python和Beautifulsoup收集了它们。接下来,我想为书籍添加类别。在书籍类别方面有一个标准。一个名为 https://www.bol.com/nl/的网站根据标准拥有所有书籍和类别。
起始网址:https://www.bol.com/nl/
国际标准书号:9780062457738
搜索后的网址:https://www.bol.com/nl/p/the-subtle-art-of-not-giving-a-f-ck/9200000053655943/
类别的 HTML 类:<li class="breadcrumbs__item"
有谁知道如何(1(在搜索栏中输入ISBN值,(2(然后提交搜索查询并使用页面进行抓取?
步骤(3(抓取所有类别是我可以做的事情。但我不知道如何做前两个步骤。
到目前为止,我为步骤 (2( 编写的代码
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
webpage = "https://www.bol.com/nl/" # edit me
searchterm = "9780062457738" # edit me
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(webpage)
sbox = driver.find_element_by_class_name("appliedSearchContextId")
sbox.send_keys(searchterm)
submit = driver.find_element_by_class_name("wsp-search__btn tst_headerSearchButton")
submit.click()
到目前为止,我为步骤 (3( 编写的代码
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.bol.com/nl/p/the-subtle-art-of-not-giving-a-f-ck/9200000053655943/')
soup = BeautifulSoup(data.text, 'html.parser')
categoryBar = soup.find('ul',{'class':'breadcrumbs breadcrumbs--show-last-item-small'})
for category in categoryBar.find_all('span',{'class':'breadcrumbs__link-label'}):
print(category.text)
您可以使用selenium
来定位输入框并遍历您的 ISBN,输入每个:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
d = webdriver.Chrome('/path/to/chromedriver')
books = ['9780062457738']
for book in books:
d.get('https://www.bol.com/nl/')
e = d.find_element_by_id('searchfor')
e.send_keys(book)
e.send_keys(Keys.ENTER)
#scrape page here
现在,对于books
中的每本书 ISBN ,解决方案将在搜索框中输入值并加载所需的页面。
你可以写一个返回类别的函数。您可以基于实际搜索,页面只是整理参数,您可以使用 GET。
import requests
from bs4 import BeautifulSoup as bs
def get_category(isbn):
r = requests.get(f'https://www.bol.com/nl/rnwy/search.html?Ntt={isbn}&searchContext=books_all')
soup = bs(r.content,'lxml')
category = soup.select_one('#option_block_4 > li:last-child .breadcrumbs__link-label')
if category is None:
return 'Not found'
else:
return category.text
isbns = ['9780141311357', '9780062457738', '9780141199078']
for isbn in isbns:
print(get_category(isbn))