使用Beautiful Soup刮擦Amazon数据时出错:对象返回None



无论我做什么,amazon id对象都会返回None。作为一个实验,我在ebay id对象上尝试了这个精确的代码,它成功了。亚马逊有什么不同?我也已经尝试将html.parser更改为lxlm,但它仍然返回:

AttributeError:"NoneType"对象没有属性"get_text">

该问题可以在getPrice((def 中找到

from bs4 import BeautifulSoup 
import time
import smtplib
URL = 'https://www.lego.com/en-us/product/darth-vader-s-castle-75251'
headers = {'Users-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'}
wanted = 80
email = "help@gmail.com"
password = 'password'
Server_name = 'mail.gmail.com'

MAIL_USE_SSL=True
def sendMail():
subject = 'Ebay Price has Dropped!!'
mailtext = "Subject:"+subject+"nn"+URL
server = smtplib.SMTP(host='smtp.gmail.com', port=587)
server.ehlo()
server.starttls()
server.login(email,password)
server.sendmail(email,email,mailtext)
print("Sent Email")
pass


def trackPrice():
price = getPrice()
if price > wanted:
diff = (price - wanted)
diff = round(diff,5)
print(f"it's still ${diff} over-priced")
else:
print('cheaper')
sendMail()

def getPrice():
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content,"html.parser")
price = soup.find(id="priceblock_ourprice").get_text().strip()[4:]
price = float(price)
print(price)
return price



if __name__ == "__main__":
while True:
trackPrice()
time.sleep(100)

假设您的实际URL类似于:

URL = "https://www.amazon.com/LEGO-Vaders-Castle-Building-Pieces/dp/B07J6F8H3M"

然后,如果你打印soup变量,你会看到亚马逊检测到你试图抓取他们的页面,并向你显示了一个错误页面,因为内容以:开头

<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
...

这解释了为什么找不到带有id="priceblock_ourprice"的HTML标记,find(...)返回None,而get_text()函数失败。

相关内容

最新更新