我正在尝试抓取"产品详细信息(一个表)";和"请选择一个大小(JavaScript按钮类型)"部分从这个基于JS的网页https://www.breuninger.com/de/damen/luxus/bekleidung-jacken-maentel/
。我使用scrapy-selenium来抓取这个网页。这段代码能够刮除这02节以外的所有内容。我只使用硒检查了它,并得到了结果。但不是用痒硒。我也用了scrapy-splash,但它甚至不能渲染整个页面。我已经查了之前的问题,但是没有找到答案。我到底哪里做错了?
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
class ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.breuninger.com']
def start_requests(self):
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.set_window_size(1920, 1080)
driver.get("https://www.breuninger.com/de/damen/luxus/bekleidung-jacken-maentel/")
time.sleep(5)
banner_btn = driver.find_element(By.XPATH, "//div[@class='banner-actions-container']/button")
banner_btn.click()
time.sleep(3)
links = driver.find_elements(By.XPATH, "//suchen-produktliste[@id='produktliste']/section/div/suchen-produkt/div/a")
for link in links:
href= link.get_attribute('href')
yield SeleniumRequest(
url = href,
callback= self.parse,
wait_time=1
)
driver.quit()
return super().start_requests()
def parse(self, response):
yield {
'Bold-title' : response.xpath("(//span[@itemprop='name'])[1]/text()").get(),
'Price' : response.xpath("//div[@itemprop='offers']/span/text()").get(),
'Beschreibung': response.xpath("//div[@class='bewerten-textformat--produktdetails-detail']/div/ul/li/text()").getall()
}
你真的不需要selenium
的重炮在这里获得产品的详细信息,如价格,描述和品牌。
你可以试试:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.breuninger.com/de/damen/luxus/bekleidung-jacken-maentel/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36",
}
soup = (
BeautifulSoup(
requests.get(url, headers=headers).text,
"lxml",
).select(".suchen-produkt a")
)
products = [
[
i.select_one(".suchen-produkt__marke").getText(),
i.select_one(".suchen-produkt__name").getText(),
i.select_one(".suchen-produkt__preis").getText(),
] for i in soup
]
df = pd.DataFrame(products, columns=["Brand", "Description", "Price"])
df.to_csv("products.csv", index=False)
print(tabulate(df, headers="keys", tablefmt="grid"))
这应该给你一个这样的表(沿着一个.csv
文件)。
+----+-------------------------+--------------------------------------------------------------+------------------+
| | Brand | Description | Price |
+====+=========================+==============================================================+==================+
| 0 | BURBERRY | Jacke BINHAM | 1.549,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 1 | BURBERRY | Trenchcoat KENSINGTON | 1.849,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 2 | RALPH LAUREN Collection | Blouson mit Schmucksteinen | 2.050 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 3 | BURBERRY | Trenchcoat KENSINGTON | 1.849,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 4 | BURBERRY | Trenchcoat WATERLOO | 1.889,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 5 | BURBERRY | Trenchcoat ISLINGTON | 1.849,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 6 | BURBERRY | Trenchcoat WATERLOO | 1.889,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 7 | MONCLER | Daunenweste LIANE | 495 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 8 | BURBERRY | Trenchcoat | 1.849,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 9 | MONCLER | Jacke im Materialmix | 650 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 10 | MONCLER | Jacke AGDE | 695 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 11 | MONCLER | Jacke CECILE | 520 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 12 | MONCLER | Jacke TIYA | 695 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 13 | MONCLER | Daunenweste LIANE | 495 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 14 | MONCLER | Daunenparka HERMANVILLE | 1.250 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 15 | BURBERRY | Trenchcoat KENSINGTON | 999,99 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 16 | MONCLER | Jacke AGDE | 695 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 17 | MONCLER | Daunenweste ALPISTE | 750 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 18 | MONCLER | Regenmantel HIENGU | 735 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 19 | MONCLER | Jacke TIYA | 695 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
| 20 | MONCLER | Jacke HOULGATE | 780 € |
+----+-------------------------+--------------------------------------------------------------+------------------+
and more ...
p。最后一个XPath在该页上不起作用,因此得到的是空列表。