如何从所有亚马逊产品页面中提取产品信息(标题、价格、评论、asin)?(python,网络抓取)



我制作了一个抓取程序,可以遍历所有亚马逊产品页面(每个页面最多有24个产品,这是模板https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2K%3Aas&keywords=as&ie=UTF8&qid=1532414215(。我运行了这个程序,但它只运行到第一页。我应该在哪里修改代码?我必须更改此行的位置吗(driver.find_element_by_id("pagnNextString"(.click(((?我附上了代码。我将感谢任何帮助。非常感谢。

程序

from time import sleep
from urllib.parse import urljoin
import csv
import requests
from lxml import html
from selenium import webdriver
import io
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-US,en;q=0.8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}
proxies = {
'http': 'http://198.1.122.29:80',
'https': 'http://204.52.206.65:8080'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
driver = webdriver.Chrome(executable_path="C:\UsersAndrei-PCDownloadswebdriverchromedriver.exe",
chrome_options=chrome_options)
header = ['Product title', 'Product price', 'Review', 'ASIN']
links = []
url = 'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215'
while True:
try:
print('Fetching url [%s]...' % url)
response = requests.get(url, headers=headers, proxies=proxies, stream=True)
if response.status_code == 200:
try:
products = driver.find_elements_by_xpath('//li[starts-with(@id, "result_")]')
for product in products:
title = product.find_element_by_tag_name('h2').text
price = ([item.text for item in
product.find_elements_by_xpath('.//a/span[contains(@class, "a-color-base")]')] + [
"No price"])[0]
review = ([item.get_attribute('textContent') for item in
product.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')] + [
"No review"])[0]
asin = product.get_attribute('data-asin') or "No asin"
try:
data = [title, price, review, asin]
except:
print('no items')
with io.open('csv/furniture.csv', "a", newline="", encoding="utf-8") as output:
writer = csv.writer(output)
writer.writerow(data)
driver.find_element_by_id("pagnNextString").click()
except IndexError:
break
except Exception:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
sleep(5)
print("Was a nice sleep, now let me continue...")

url = urljoin('https://www.amazon.com', next_url)
for i in range(len(url)):
driver.get(url[i])

这些行执行以下操作:

  1. url = urljoin('https://www.amazon.com', next_url)获取URL作为字符串,例如https://www.amazon.com/some_source,并将其分配给url变量
  2. for i in range(len(url))遍历整数0, 1, 2, 3, ... len(url)的范围,并将每个整数分配给i变量
  3. driver.get(url[i])导航到字符,例如driver.get("h")driver.get("t")

我不知道你到底想做什么,但我想你需要

url = urljoin('https://www.amazon.com', next_url)
driver.get(url)

更新

如果您需要检查所有页面,请尝试添加

driver.find_element_by_xpath('//a/span[@id="pagnNextString"]').click()

每次刮页后。

还要注意,for product in products永远不会导致IndexError,因此您可以避免将try/except用于此循环

最新更新