我制作了一个抓取程序,可以遍历所有亚马逊产品页面(每个页面最多有24个产品,这是模板https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2K%3Aas&keywords=as&ie=UTF8&qid=1532414215(。我运行了这个程序,但它只运行到第一页。我应该在哪里修改代码?我必须更改此行的位置吗(driver.find_element_by_id("pagnNextString"(.click(((?我附上了代码。我将感谢任何帮助。非常感谢。
程序
from time import sleep
from urllib.parse import urljoin
import csv
import requests
from lxml import html
from selenium import webdriver
import io
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-US,en;q=0.8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}
proxies = {
'http': 'http://198.1.122.29:80',
'https': 'http://204.52.206.65:8080'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
driver = webdriver.Chrome(executable_path="C:\UsersAndrei-PCDownloadswebdriverchromedriver.exe",
chrome_options=chrome_options)
header = ['Product title', 'Product price', 'Review', 'ASIN']
links = []
url = 'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215'
while True:
try:
print('Fetching url [%s]...' % url)
response = requests.get(url, headers=headers, proxies=proxies, stream=True)
if response.status_code == 200:
try:
products = driver.find_elements_by_xpath('//li[starts-with(@id, "result_")]')
for product in products:
title = product.find_element_by_tag_name('h2').text
price = ([item.text for item in
product.find_elements_by_xpath('.//a/span[contains(@class, "a-color-base")]')] + [
"No price"])[0]
review = ([item.get_attribute('textContent') for item in
product.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')] + [
"No review"])[0]
asin = product.get_attribute('data-asin') or "No asin"
try:
data = [title, price, review, asin]
except:
print('no items')
with io.open('csv/furniture.csv', "a", newline="", encoding="utf-8") as output:
writer = csv.writer(output)
writer.writerow(data)
driver.find_element_by_id("pagnNextString").click()
except IndexError:
break
except Exception:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
sleep(5)
print("Was a nice sleep, now let me continue...")
url = urljoin('https://www.amazon.com', next_url)
for i in range(len(url)):
driver.get(url[i])
这些行执行以下操作:
url = urljoin('https://www.amazon.com', next_url)
获取URL作为字符串,例如https://www.amazon.com/some_source
,并将其分配给url
变量for i in range(len(url))
遍历整数0, 1, 2, 3, ... len(url)
的范围,并将每个整数分配给i
变量driver.get(url[i])
导航到字符,例如driver.get("h")
、driver.get("t")
我不知道你到底想做什么,但我想你需要
url = urljoin('https://www.amazon.com', next_url)
driver.get(url)
更新
如果您需要检查所有页面,请尝试添加
driver.find_element_by_xpath('//a/span[@id="pagnNextString"]').click()
每次刮页后。
还要注意,for product in products
永远不会导致IndexError
,因此您可以避免将try
/except
用于此循环