如何从python中的script标签中获取var值编号



在给定的.html页面中,我有一个脚本标记,如下所示:

<script>
some data 
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null}; 
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1}; 
var pageSize = 48;
</script>

我正试着用汤取回总页码。

我的以下代码是这样的:

pattern= re.compile(r'"totalPage":(d+);', re.MULTILINE | re.DOTALL) 
scripts =soup.find_all('script', text=pattern)
if scripts:
match = pattern.search(scripts.text)
print(match)

上面的代码返回了一个空白列表,而我只需要将数字12作为数字返回。请帮忙。

有很多方法可以提取数字:

1.使用普通re

import re
from bs4 import BeautifulSoup

html_doc = """
<script>
some data 
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null}; 
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1}; 
var pageSize = 48;
</script>"""
soup = BeautifulSoup(html_doc, "html.parser")
script = soup.find("script", text=lambda t: t and "totalPage" in t)
print(re.search(r"totalPageD+(d+)", script.text).group(1))

打印:

12

2.使用js2py

import js2py
script = soup.find("script", text=lambda t: t and "totalPage" in t)
s = "function $() {" + script.text + " return pageList;}"
print(js2py.eval_js(s)()["totalPage"])

打印:

12

3.使用re/json

import re
import json
script = soup.find("script", text=lambda t: t and "totalPage" in t)
n = json.loads(re.search(r"pageList = (.*);", script.text).group(1))[
"totalPage"
]
print(n)

打印:

12

最新更新