我有一个csv表,其中有一列包含聊天日志中的文本。每个文本行遵循相同的格式消息的人名和时间(带有额外的前后空格填充(,然后是消息内容。文本列的单行示例:
' Siri (3:15pm) Hello how can I help you? John Wayne (3:17pm) what day of the week is today Siri (3:18pm) it is Monday.'
我想把这个单一的字符串列转换成多个列(列的数量取决于消息的数量(,每个单独的消息有一列,如下所示:
Siri (3:15pm) Hello how can I help you
John Wayne (3:17pm) what day of the week is today
Siri (3:18pm) it is Monday
如何解析pandas数据帧列中的文本,将聊天日志分隔为单独的消息列?
如果您有这个数据帧:
Messages
0 Siri (3:15pm) Hello how can I help you? John Wayne (3:17pm) what day of the week is today Siri (3:18pm) it is Monday.
那么你可以做:
x = df["Messages"].str.split(r"s{2,}").explode()
out = (x[::2] + " " + x[1::2]).to_frame()
print(out)
打印:
Messages
0 Siri (3:15pm) Hello how can I help you?
0 John Wayne (3:17pm) what day of the week is today
0 Siri (3:18pm) it is Monday.
注意:只有在Name和Text之间有2个以上空格时才有效
我就是这样做的,花了我一段时间,但我们做到了!
s = pd.Series([' Siri (3:15pm) Hello how can I help you? John Wayne (3:17pm) what day of the week is today Siri (3:18pm) it is Monday.'])
s = s.str.split(r" ", expand=True)
s = s.drop(labels=[0], axis=1)
s = s.transpose()
for i in s.index:
list_1 = list(s[0])
odd_i = []
even_i = []
for i in range(0, len(list_1)):
if i % 2:
even_i.append(list_1[i])
else :
odd_i.append(list_1[i])
d = {'Name': odd_i, 'Message': even_i}
df = pd.DataFrame(data=d)
df
Output:
Name Message
0 Siri (3:15pm) Hello how can I help you?
1 John Wayne (3:17pm) what day of the week is today
2 Siri (3:18pm) it is Monday.