从亚马逊的所有页面中抓取带有美丽汤的链接会导致错误



我正在尝试通过浏览每个页面从亚马逊网上商店中抓取产品URL。

import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64;     x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate",     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)

我没有获取页面的内容,而是得到一个"对不起,我们这边出了点问题">HTML错误页面。这背后的原因是什么?如何成功绕过此错误并抓取产品?

根据您的问题:

请注意,AMAZON不允许自动访问其数据!因此,您可以通过r.status_code检查响应来仔细检查这一点!这可能会导致您出现该错误 MSG:

To discuss automated access to Amazon data please contact api-services-support@amazon.com

因此,您可以使用AMAZON API,也可以通过proxies = list_proxiesproxies列表传递给 GET 请求。

这是在不被阻止的情况下将headers传递给Amazon的正确方法,它是 Works。

import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")

在线运行:单击此处

最新更新