如何将数据帧中的每个单词组合成一个句子,并在句号或问号后生成下一个句子?
原始数据帧如下所示:
start_time end_time words
0.1 0.2 I
0.3 0.4 AM
0.5 0.6 GOOD.
0.7 0.8 HOW
0.9 1.0 ABOUT
1.1 1.2 YOU?
1.3 1.4 OK!
我想要得到的结果是这样的:
start_time end_time words
0.1 0.6 I AM GOOD.
0.7 1.2 HOW ABOUT YOU?
1.3 1.4 OK!
这是我的数据帧:
data = {'start_time': [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3],
'end_time': [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4],
'word':['I','AM','OK.','HOW', 'ABOUT', 'YOU?','OK!']}
df = pd.DataFrame(data, columns = ['start_time', 'end_time','word'])
有什么建议的算法可以帮助解决这个问题吗,非常感谢!
尝试:
import re
pattern = re.compile(r".|!|?$")
df_out = df.groupby(
df.word.apply(lambda x: bool(pattern.search(x))).shift().fillna(0).cumsum()
).agg({"start_time": "first", "end_time": "last", "word": " ".join})
print(df_out)
打印:
start_time end_time word
word
0 0.1 0.6 I AM OK.
1 0.7 1.2 HOW ABOUT YOU?
2 1.3 1.4 OK!