我刚开始用beautifulsoup
这是我当前的代码
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers, verify=False)
src = res.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")
urls = []
for div in soup.find_all("div", attrs={'id':"mountRoot"}):
print(div)
print("n")
for div_tag in div.find_all('div'):
print(div_tag)
embedded_div = div_tag.find('div')
print(embedded_div)
这段代码的输出:
<div id="mountRoot" style="min-height:750px;margin-top:-2px">< div class="loader-container">< div class="spinner-spinner">< /div>< /div>< /div>
<div class="loader-container">< div class="spinner-spinner">< /div>< /div>
<div class="spinner-spinner">< /div>
<div class="spinner-spinner">< /div>
这里是我正在查看的网站的inspect元素: https://i.stack.imgur.com/zui3R.png
在我看来,它似乎忽略了
似乎第一行缓存到页面的script
标记与属性type="application/ld+json"
这样:
<script type="application/ld+json">{ some big json here }</script>
您可以通过选择json键@type:"ItemList"
获取数据,然后获得项目:
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
res = s.get("https://www.myntra.com/jordan", headers=headers)
soup = BeautifulSoup(res.content, 'html.parser')
data_json = [
json.loads(t.text)
for t in soup.findAll("script",{"type":"application/ld+json"})
]
data = [
t
for t in data_json
if t["@type"] == "ItemList"
]
print(data[0]["itemListElement"])
但是它只会打印几行,为了获得分页数据,有一个API:
GET https://www.myntra.com/gateway/v2/search/jordan
下面的代码将使用API获得第一页:
import requests
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
s = requests.Session()
s.get("https://www.myntra.com/jordan", headers=headers)
# first page
r = s.get("https://www.myntra.com/gateway/v2/search/jordan",
params = {
"p": "1",
"rows": 50,
"o": 0,
"plaEnabled":"false"
},
headers=headers
)
print(r.json())
您需要增加p
以移动到下一页。此外,o
是偏移索引,每次将其增加per_page - 1
。例如,如果您设置了"rows":50
"o":49
。