使用beautifulsoup找不到嵌入在另一个div中的div



我刚开始用beautifulsoup

这是我当前的代码

import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers, verify=False)
src = res.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")
urls = []
for div in soup.find_all("div", attrs={'id':"mountRoot"}):
print(div)
print("n")
for div_tag in div.find_all('div'):
print(div_tag)
embedded_div = div_tag.find('div')
print(embedded_div)

这段代码的输出:

<div id="mountRoot" style="min-height:750px;margin-top:-2px">< div class="loader-container">< div class="spinner-spinner">< /div>< /div>< /div>
<div class="loader-container">< div class="spinner-spinner">< /div>< /div>
<div class="spinner-spinner">< /div>
<div class="spinner-spinner">< /div>

这里是我正在查看的网站的inspect元素: https://i.stack.imgur.com/zui3R.png

在我看来,它似乎忽略了

似乎第一行缓存到页面的script标记与属性type="application/ld+json"这样:

<script type="application/ld+json">{ some big json here }</script>

您可以通过选择json键@type:"ItemList"获取数据,然后获得项目:

import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers)
soup = BeautifulSoup(res.content, 'html.parser')
data_json = [ 
json.loads(t.text)
for t in soup.findAll("script",{"type":"application/ld+json"})
]
data = [
t
for t in data_json
if t["@type"] == "ItemList"
]
print(data[0]["itemListElement"])

但是它只会打印几行,为了获得分页数据,有一个API:

GET https://www.myntra.com/gateway/v2/search/jordan

下面的代码将使用API获得第一页:

import requests
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
s.get("https://www.myntra.com/jordan", headers=headers)
# first page
r = s.get("https://www.myntra.com/gateway/v2/search/jordan",
params = {
"p": "1",
"rows": 50,
"o": 0,
"plaEnabled":"false"
},
headers=headers
)
print(r.json())

您需要增加p以移动到下一页。此外,o是偏移索引,每次将其增加per_page - 1。例如,如果您设置了"rows":50

,则第二页将具有"o":49

最新更新