Amazon抓取-抓取有时有效



我正在从amazon上抓取数据用于教育目的,我在cookie和antibot上有一些问题。我设法抓取数据,但有时,cookie不会在响应中,或者抗体标记我。

我已经尝试像这样使用一个随机的标题列表:

headers_list = [{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-User": "?1",
"TE": "trailers"
},
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "fr-FR,fr;q=0.7",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
]

并将以下内容放入我的代码中:

headers = random.choice(headers_list)
with requests.Session() as s:
res = s.get(url, headers=headers)
if not res.cookies:
print("Error getting cookies")
raise SystemExit(1)

但这并不能解决问题,我仍然有时在我的响应和检测中没有得到cookie。

我是这样抓取数据的:

post = s.post(url, data=login_data, headers=headers, cookies=cookies, allow_redirects=True)
soup = BeautifulSoup(post.text, 'html.parser')
if soup.find('input', {'name': 'appActionToken'})['value'] is not None 
and soup.find('input', {'name': 'appAction'})['value'] is not None 
and soup.find('input', {'name': 'subPageType'})['value'] is not None 
and soup.find('input', {'name': 'openid.return_to'})['value'] is not None 
and soup.find('input', {'name': 'prevRID'})['value'] is not None 
and soup.find('input', {'name': 'workflowState'})['value'] is not None 
and soup.find('input', {'name': 'email'})['value'] is not None:
print("found")
else:
print("not found")
raise SystemExit(1)

但是当antibot检测到我时,这个内容将不可用,从而抛出一个错误。有什么办法能防止吗?谢谢!

可以设置time.sleep(10)在每次刮擦操作之前的一段时间内。这对亚马逊来说很难抓住你,但如果你发送太多的常规请求,他们也可能会发现并阻止它们。

  • 使用随机用户代理旋转请求标头(使用更多用户代理更新您的标头列表)

  • 删除/dp/ASIN/之后的所有内容(跟踪参数)从产品url

    例如,在删除跟踪参数后,你的url将是这样的:https://www.amazon.com/Storage-Stackable-Organizer-Foldable-Containers/dp/B097PVKRYM/
  • 在请求之间添加少量睡眠(使用time.sleep()))

  • 使用代理与您的请求(您可以使用Tor代理,如果他们阻止Tor与其他付费代理服务)

最新更新