如何从课堂选择的元素中提取文本



我正在尝试使用BeautifulSoup从网站中提取数据。

网站数据显示:

<div content-43 class="item-name">This is the text I want to grab</div>

我目前正在使用:

item_store = soup.find_all("div",{"class":"item-name"}) 

然而,它会像div标记一样返回整行HTML,而不仅仅是我想要的文本。

您必须使用.get_text()来提取文本,而不是元素-请注意,如果您必须在调用方法之前迭代find_all()ResultSet

在单个元素上使用find()

soup.find("div",{"class":"item-name"}).get_text()

ResultSet:上使用find_all()

[e.get_text() for e in soup.find_all("div",{"class":"item-name"})]

同时在ResultSet:上使用select()css selectors

[e.get_text() for e in soup.select('div.item-name')]

示例

from bs4 import BeautifulSoup
html = '''
<div content-43 class="item-name">This is the text I grab with find() and also with find_all()</div>
<div content-43 class="item-name">This is the text I want to grab with find_all() </div>
'''
soup = BeautifulSoup(html)
print(soup.find("div",{"class":"item-name"}).get_text())
print([e.get_text() for e in soup.find_all("div",{"class":"item-name"})])

输出

This is the text I grab with find() and also with find_all()

['This is the text I grab with find() and also with find_all()',
'This is the text I want to grab with find_all() ']

您应该使用.get_text()方法或text属性
您可以像这样打印它们

for item in item_store:
print(item.text)
# print(item.get_text())

最新更新