漂亮汤如何找到JS加载页面的脚本标签



我使用此代码来获取此页面的内容https://www.walmart.com/cp/976759

import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
res = requests.get("https://www.walmart.com/cp/976759", headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
script = soup.find("script", {"id":"category"})
data = json.loads(script.get_text(strip=True))
with open("data.json", "w") as f:
json.dump(data, f)

完整的数据存储在一个以id作为类别的脚本标记中,如本答案中所述。lxml-webscratching返回空值。

我还有更多的页面要获取,而且它们似乎也是通过javascript加载的。确定存储站点数据的脚本标记id的方法是什么?例如,我如何确定这些链接的脚本标签id

https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+食品%20咖啡%20商店%20by%20类别%20瓷砖%201

还有这个

https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C+++%7C+咖啡%20瓶%20咖啡%20特色%20类别%20可折叠

您可以使用正则表达式来匹配属性,也可以排除属性。我意识到您要查找的script标签都是application/json类型,这是我制作的第一个过滤器,即soup.find_all('script', {'type': 'application/json'})。接下来,有一些以tb-djs-wlm开头的标签,这些标签指的是几个图像。我使用正则表达式re.compile(r'^((?!tb-djs).)*$')排除它们。

所以,现在我们有了:

from bs4 import BeautifulSoup
import requests
import re
session = requests.Session()
# your test urls
url1 = 'https://www.walmart.com/cp/coffee/1086446?povid=976759+%7C+2018-12-26+%7C+Food%20Coffee%20Shop%20by%20Category%20Tile%201'
url2 = 'https://www.walmart.com/browse/food/coffee/976759_1086446_1229654?povid=1086446+%7C++%7C+Coffee%20Bottle%20Coffee%20Featured%20Categories%20Collapsible'
url3 = 'https://www.walmart.com/cp/976759'
urls = [url1, url2, url3]
def find_tag(soup):
script = soup.find('script', {'type': 'application/json', 'id':re.compile(r'^((?!tb-djs).)*$')})
return script['id']
for url in urls:
soup = BeautifulSoup(session.get(url).text, 'html.parser')
print(find_tag(soup))
# category
# searchContent
# category

要获得脚本的内容,可以使用json库和bs4标记元素,只需使用json.loads(script_soup.text)加载即可

最新更新