BeautifulSoup: get_text() 从 bs4 标签返回空字符串

我正在尝试从此新闻页面中提取信息。

首先，我解析页面：

import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.theguardian.com/politics/2019/oct/20/boris-johnson-could-be-held-in-contempt-of-court-over-brexit-letter")
soup = BeautifulSoup(page.content, 'html.parser')

然后我从标题开始：

title = soup.find('meta', property="og:title")

如果我打印它，我会得到：

<meta content="Boris Johnson could be held in contempt of court over Brexit letter" property="og:title"/>

但是，当我运行title.get_text()时，结果是一个空字符串：''

我的错误在哪里？

这是因为标签实际上没有定义任何文本。在这种情况下，您所追求的"文本"包含在属性为content的<meta>标签中。所以你需要拉出content的值：

import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.theguardian.com/politics/2019/oct/20/boris-johnson-could-be-held-in-contempt-of-court-over-brexit-letter")
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('meta', property="og:title")['content']

输出：

print (title)
Boris Johnson could be held in contempt of court over Brexit letter

您可以使用.attrs获取所有属性和值。这将返回给定标签内属性和值的字典(键：值对(：

title = soup.find('meta', property="og:title")
print (title.attrs)

输出：

print (title.attrs)
{'property': 'og:title', 'content': 'Boris Johnson could be held in contempt of court over Brexit letter'}

相关内容

最新更新

热门标签：