使用beautifulsoup在不同的选项卡中打开产品页面,以获取亚马逊中输入的搜索结果



我对python很陌生,对网络抓取也很陌生-目前正在Al Sweigart的书《用python自动化无聊的东西》中学习,有一个建议的练习任务,基本上是制作一个能做到这一点的程序:

  • 输入要在亚马逊中搜索的产品
  • 使用request.get((和.text((获取该搜索页面的html
  • 使用beautifulsoup在html中搜索表示产品页面链接的css选择器
  • 在单独的选项卡中,打开搜索结果的前五个产品的选项卡

这是我的代码:

#! python3
# Searches amazon for the inputted product (either through command line or input) and opens 5 tabs with the top 
# items for that search. 
import requests, sys, bs4, webbrowser
if len(sys.argv) > 1: # if there are system arguments
res = requests.get('https://www.amazon.com/s?k=' + ''.join(sys.argv))
res.raise_for_status
else: # take input
print('what product would you like to search Amazon for?')
product = str(input())
res = requests.get('https://www.amazon.com/s?k=' + ''.join(product))
res.raise_for_status

# retrieve top search links:
soup = bs4.BeautifulSoup(res.text, 'html.parser')

print(res.text) # TO CHECK HTML OF SITE, GET RID OF DURING ACTUAL PROGRAM
# open a new tab for the top 5 items, and get the css selector for links 
# a list of all things on the downloaded page that are within the css selector 'a-link-normal a-text-normal'
linkElems = soup.select('a-link-normal a-text-normal') 

numOpen = min(5, len(linkElems))
for i in range(numOpen):
urlToOpen = 'https://www.amazon.com/' + linkElems[i].get('href')
print('Opening', urlToOpen)
webbrowser.open(urlToOpen)

我认为我已经选择了正确的css选择器("a-link-normal a-text-normal"(,所以我认为问题出在res.text((上——当我打印以查看它的外观时,html内容似乎不完整,或者当我在chrome中使用inspect元素查看同一个网站时,包含实际html的内容。此外,该html中没有一个包含任何内容,例如";a链接-正常a文本-正常";。

对于一个示例,res.text((是搜索"大铅笔"的样子:

what product would you like to search Amazon for?
big pencil
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Sorry! Something went wrong!</title>
<style>
html, body {
padding: 0;
margin: 0
}
img {
border: 0
}
#a {
background: #232f3e;
padding: 11px 11px 11px 192px
}
#b {
position: absolute;
left: 22px;
top: 12px
}
#c {
position: relative;
max-width: 800px;
padding: 0 40px 0 0
}
#e, #f {
height: 35px;
border: 0;
font-size: 1em
}
#e {
width: 100%;
margin: 0;
padding: 0 10px;
border-radius: 4px 0 0 4px
}
#f {
cursor: pointer;
background: #febd69;
font-weight: bold;
border-radius: 0 4px 4px 0;
-webkit-appearance: none;
position: absolute;
top: 0;
right: 0;
padding: 0 12px
}
@media (max-width: 500px) {
#a {
padding: 55px 10px 10px
}
#b {
left: 6px
}
}
#g {
text-align: center;
margin: 30px 0
}
#g img {
max-width: 90%
}
#d {
display: none
}
#d[src] {
display: inline
}
</style>
</head>
<body>
<a href="/ref=cs_503_logo"><img id="b" src="https://images-na.ssl-images-amazon.com/images/G/01/error/logo._TTD_.png" alt="Amazon.com"></a>
<form id="a" accept-charset="utf-8" action="/s" method="GET" role="search">
<div id="c">
<input id="e" name="field-keywords" placeholder="Search">
<input name="ref" type="hidden" value="cs_503_search">
<input id="f" type="submit" value="Go">
</div>
</form>
<div id="g">
<div><a href="/ref=cs_503_link"><img src="https://images-na.ssl-images-amazon.com/images/G/01/error/500_503.png"
alt="Sorry! Something went wrong on our end. Please go back and try again or go to Amazon's home page."></a>
</div>
<a href="/dogsofamazon/ref=cs_503_d" target="_blank" rel="noopener noreferrer"><img id="d" alt="Dogs of Amazon"></a>
<script>document.getElementById("d").src = "https://images-na.ssl-images-amazon.com/images/G/01/error/" + (Math.floor(Math.random() * 43) + 1) + "._TTD_.jpg";</script>
</div>
</body>
</html>

非常感谢你的耐心。

这是一个经典的案例,如果你试图使用BeautifulSoup这样的刮刀直接刮取网站,你将找不到任何东西。

该网站的工作方式是,首先将初始代码块下载到浏览器中,与您为big pencil添加的代码块相同,然后通过Javascript加载页面上的其余元素。

您需要使用SeleniumWebdriver首先加载页面,然后从浏览器中获取代码。在正常意义上,这相当于打开浏览器的控制台,转到Elements选项卡,查找您提到的类。

要查看差异,我建议您查看页面的源代码,并与元素选项卡中的代码进行比较

在这里,您需要使用通过BS4获取加载到浏览器上的数据

from selenium import webdriver
browser = webdriver.Chrome("path_to_chromedriver") # This is the Chromedriver which will open up a new instance of a browser for you. More info in the docs
browser.get(url) # Fetch the URL on the browser
soup = bs4.BeautifulSoup(browser.page_source, 'html.parser') # Now load it to BS4 and go on with extracting the elements and so on

这是理解Selenium的一个非常基本的代码,然而,在生产用例中,您可能想要使用像PhantomJS 这样的无头浏览器

参考文献:

  • 彩色打印机
  • 硒与Python

最新更新