python3:bs4在某些网站上存在问题



我正在学习python和bs4。

根据一些建议和许多网站,我写了这个脚本:

import requests as rq
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
def get_price(site):
html = rq.get(site, headers=header).text
soup = BeautifulSoup(html, 'html.parser')
try:
price = soup.find(id="priceblock_ourprice").get_text()
print(site)
print(price)
except:
print(site)
print("failed")
sites = ["https://www.amazon.in/Apple-iPhone-11-64GB-Green/dp/B07XVKBY68/ref=sr_1_7?keywords=iphone+11&qid=1573668357&sr=8-7",
"https://www.amazon.it/Apple-iPhone-64GB-Verde-Ricondizionato/dp/B082DN72G3/ref=sr_1_19?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-19", 
"https://www.amazon.it/Apple-iPhone-11-128GB-Verde/dp/B07XS5MSW4/ref=sr_1_1_sspa?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExNlhGMElFNUhJMTBJJmVuY3J5cHRlZElkPUEwMTI2OTMxMVpXWEtHQ1o5S0ZENCZlbmNyeXB0ZWRBZElkPUEwOTMyMTczMVdMMzlQOTRPTUE3SCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=" ]
for site in sites:
get_price(site)
print("n")

我运行它并得到这个:

https://www.amazon.in/Apple-iPhone-11-64GB-Green/dp/B07XVKBY68/ref=sr_1_7?keywords=iphone+11&qid=1573668357&sr=8-7
₹ 64,499.00
https://www.amazon.it/Apple-iPhone-64GB-Verde-Ricondizionato/dp/B082DN72G3/ref=sr_1_19?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-19
failed
https://www.amazon.it/Apple-iPhone-11-128GB-Verde/dp/B07XS5MSW4/ref=sr_1_1_sspa?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExNlhGMElFNUhJMTBJJmVuY3J5cHRlZElkPUEwMTI2OTMxMVpXWEtHQ1o5S0ZENCZlbmNyeXB0ZWRBZElkPUEwOTMyMTczMVdMMzlQOTRPTUE3SCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=
749,00 €

我不明白为什么第二个网站不应该工作

字符串priceblock_ourprice存在:

$ wget -q -O - 'https://www.amazon.it/Apple-iPhone-64GB-Verde-Ricondizionato/dp/B082DN72G3/ref=sr_1_19?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-19' 2>&1 | grep "priceblock_ourprice"
<span id="priceblock_ourprice" class="a-size-medium a-color-price priceBlockBuyingPriceString">629,00 €</span>

问题是amazon提供的HTMLhtml.parser无法正确解析。解决方案是使用lxmlhtml5lib解析器:

import requests as rq
from bs4 import BeautifulSoup

header = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}
def get_price(site):
html = rq.get(site, headers=header).text
soup = BeautifulSoup(html, 'lxml')      # <--- use 'lxml' or 'html5lib' parser
try:
price = soup.find(id="priceblock_ourprice").get_text()
print(site)
print(price)
except:
print(site)
print("failed")
sites = ["https://www.amazon.in/Apple-iPhone-11-64GB-Green/dp/B07XVKBY68/ref=sr_1_7?keywords=iphone+11&qid=1573668357&sr=8-7",
"https://www.amazon.it/Apple-iPhone-64GB-Verde-Ricondizionato/dp/B082DN72G3/ref=sr_1_19?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-19", 
"https://www.amazon.it/Apple-iPhone-11-128GB-Verde/dp/B07XS5MSW4/ref=sr_1_1_sspa?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExNlhGMElFNUhJMTBJJmVuY3J5cHRlZElkPUEwMTI2OTMxMVpXWEtHQ1o5S0ZENCZlbmNyeXB0ZWRBZElkPUEwOTMyMTczMVdMMzlQOTRPTUE3SCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=" ]
for site in sites:
get_price(site)
print("n")

打印:

https://www.amazon.in/Apple-iPhone-11-64GB-Green/dp/B07XVKBY68/ref=sr_1_7?keywords=iphone+11&qid=1573668357&sr=8-7
₹ 64,499.00

https://www.amazon.it/Apple-iPhone-64GB-Verde-Ricondizionato/dp/B082DN72G3/ref=sr_1_19?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-19
744,89 €

https://www.amazon.it/Apple-iPhone-11-128GB-Verde/dp/B07XS5MSW4/ref=sr_1_1_sspa?__mk_it_IT=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=iphone+11&qid=1601755114&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUExNlhGMElFNUhJMTBJJmVuY3J5cHRlZElkPUEwMTI2OTMxMVpXWEtHQ1o5S0ZENCZlbmNyeXB0ZWRBZElkPUEwOTMyMTczMVdMMzlQOTRPTUE3SCZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=
749,00 €

最新更新