嗨,我有一个本地html文件,其中包含聊天消息:
<div class="body">
<div class="pull_right date details" title="01.01.2022 01:01:01">
01:01
</div>
<div class="from_name">
XYZ
</div>
<div class="reply_to details">
In reply to <a href="#go_to_message23" onclick="return GoToMessage(747)">this message</a>
</div>
<div class="text">
Eat some chocolate
</div>
现在我想创建一个df,显示每条消息的特定信息。例:我用提取写消息的用户的名字:
# doc.select('div[id]')[2].select_one('.from_name').text.strip()
messages = doc.select('div[id]')
for message in messages:
print('---')
try:
print([message.select_one('.from_name').text.strip()])
except:
print("Couldn't find a name")
但是我不知道如何提取消息发送的日期。有人能帮忙吗?由于
只需选择元素并调用其属性title
来提取值:
select_one('div[title]').get('title')
from bs4 import BeautifulSoup
html='''
<div class="body">
<div class="pull_right date details" title="01.01.2022 01:01:01">
01:01
</div>
<div class="from_name">
XYZ
</div>
<div class="reply_to details">
In reply to <a href="#go_to_message23" onclick="return GoToMessage(747)">this message</a>
</div>
<div class="text">
Eat some chocolate
</div>'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.body'):
data.append({
'date':e.select_one('div[title]').get('title'),
'name':e.select_one('div.from_name').contents[0].strip(),
'text':e.select_one('div.text').text.strip(),
})
data
输出[{'date': '01.01.2022 01:01:01', 'name': 'XYZ', 'text': 'Eat some chocolate'}]
from bs4 import BeautifulSoup
html='''
<div class="body">
<div class="pull_right date details" title="01.01.2022 01:01:01">
01:01
</div>
<div class="from_name">
XYZ
</div>
<div class="reply_to details">
In reply to <a href="#go_to_message23" onclick="return GoToMessage(747)">this message</a>
</div>
<div class="text">
Eat some chocolate
</div>'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('div.body'):
data.append({
'date':e.select_one('div[title]').get('title'),
'name':e.select_one('div.from_name').contents[0].strip(),
'text':e.select_one('div.text').text.strip(),
})
data
[{'date': '01.01.2022 01:01:01', 'name': 'XYZ', 'text': 'Eat some chocolate'}]
你可以简单地把你的结果变成一个数据帧:
import pandas as pd
pd.DataFrame(data)
输出date name text
0 01.01.2022 01:01:01 XYZ Eat some chocolate