1.使用普通
2.使用
3.使用
在给定的.html页面中,我有一个脚本标记,如下所示:
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>
我正试着用汤取回总页码。
我的以下代码是这样的:
pattern= re.compile(r'"totalPage":(d+);', re.MULTILINE | re.DOTALL)
scripts =soup.find_all('script', text=pattern)
if scripts:
match = pattern.search(scripts.text)
print(match)
上面的代码返回了一个空白列表,而我只需要将数字12作为数字返回。请帮忙。
有很多方法可以提取数字:
1.使用普通re
import re
from bs4 import BeautifulSoup
html_doc = """
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>"""
soup = BeautifulSoup(html_doc, "html.parser")
script = soup.find("script", text=lambda t: t and "totalPage" in t)
print(re.search(r"totalPageD+(d+)", script.text).group(1))
打印:
12
2.使用js2py
import js2py
script = soup.find("script", text=lambda t: t and "totalPage" in t)
s = "function $() {" + script.text + " return pageList;}"
print(js2py.eval_js(s)()["totalPage"])
打印:
12
3.使用re
/json
import re
import json
script = soup.find("script", text=lambda t: t and "totalPage" in t)
n = json.loads(re.search(r"pageList = (.*);", script.text).group(1))[
"totalPage"
]
print(n)
打印:
12