我正在尝试使用BeautifulSoup
从网站中提取数据。
网站数据显示:
<div content-43 class="item-name">This is the text I want to grab</div>
我目前正在使用:
item_store = soup.find_all("div",{"class":"item-name"})
然而,它会像div标记一样返回整行HTML,而不仅仅是我想要的文本。
您必须使用.get_text()
来提取文本,而不是元素-请注意,如果您必须在调用方法之前迭代find_all()
的ResultSet
。
在单个元素上使用find()
:
soup.find("div",{"class":"item-name"}).get_text()
在ResultSet
:上使用find_all()
[e.get_text() for e in soup.find_all("div",{"class":"item-name"})]
同时在ResultSet
:上使用select()
和css selectors
[e.get_text() for e in soup.select('div.item-name')]
示例
from bs4 import BeautifulSoup
html = '''
<div content-43 class="item-name">This is the text I grab with find() and also with find_all()</div>
<div content-43 class="item-name">This is the text I want to grab with find_all() </div>
'''
soup = BeautifulSoup(html)
print(soup.find("div",{"class":"item-name"}).get_text())
print([e.get_text() for e in soup.find_all("div",{"class":"item-name"})])
输出
This is the text I grab with find() and also with find_all()
和
['This is the text I grab with find() and also with find_all()',
'This is the text I want to grab with find_all() ']
您应该使用.get_text()
方法或text
属性
您可以像这样打印它们
for item in item_store:
print(item.text)
# print(item.get_text())