如何使用Beautiful Soup为HTML页面找到唯一的文章id/页面id



我正试图找到与每个HTML页面相关联的唯一文章ID/页面ID。然而,问题是对于每个HTML页面,文章id的格式不同。例如,articleId、articleId、articleId、value、netID等。对于大多数HTML页面,文章id可以在script标记中找到。这就是它在脚本标签中的文本外观:

<script type="text/javascript">var lf_config = [{"collectionMeta":"eyJ0eXAiOiJqd3QiLCJhbGciOiJIUzI1NiJ9.eyJ0aXRsZSI6IkN1dCBGYXQgRmFzdCYjODIzMDthbmQgSGF2ZSBGdW4gRG9pbmcgSXQhIiwidXJsIjoiaHR0cHM6XC9cL2Jsb29kcHJlc3N1cmVzb2x1dGlvbi5jb21cL2N1dC1mYXQtZmFzdC1mdW5cLyIsInRhZ3MiOiIiLCJjaGVja3N1bSI6IjIxODcxZjdmYTVkZTcwNjQ2NDAyNzk2YjFjMDFiZTE2IiwiYXJ0aWNsZUlkIjoxMTMzfQ.A4dXaOb2eIKk2OiANm0USozRiof21OKzQUjvy6fymgg",
"checksum":"21871f7fa5de70646402796b1c01be16",
"siteId":"339299",
"articleId":1133,"strings":"","el":"livefyre-comments"}];var conv = fyre.conv.load({}, lf_config);</script>
<script type="text/javascript">
/* <![CDATA[ */
var wpcf7 = {"apiSettings":{"root":"https://bloodpressuresolution.com/wp-json/contact-form-7/v1","namespace":"contact-form-7/v1"},"recaptcha":{"messages":{"empty":"Please verify that you are not a robot."}}};
/* ]]> */
</script>

这是我尝试过的代码,但还没有产生任何输出。列表new_link包含每个HTML页面的URL。我认为正则表达式很好,但我无法解析标记中的文本并搜索正则表达式。我想最终将文章id及其值存储为我的输出。请帮我弄清楚如何从每个HTML页面中找到唯一的文章ID。

for i in new_link:
new_req = requests.get(i, headers=hdr)
soup = BeautifulSoup(new_req.text, "html.parser")
scripts = soup.findAll("script", attrs={"type" : "text/javascript"})
for j in scripts:
temp = re.findall(pattern, str(j))
print(temp)```

如果您想通过id 找到en元素或对象

div = soup.find(id="articlebody")

最新更新