我有一个采访的转录,我已经转换成df(我最近才开始学习pandas框架)。df的结构如下:
Timestamp Content Speaker
00:00:00.000-00:00:01.100 Content Speaker 1
00:00:03.470-00:00:07.760 Content Speaker 1
00:00:09.340-00:00:13.690 Content Speaker 1
00:00:13.690-00:00:13.760 Content Speaker 2
00:00:14.550-00:00:14.880 Content Speaker 2
00:00:17.000-00:00:30.510 Content Speaker 1
00:00:30.740-00:00:41.930 Content Speaker 2
我想要实现的是跨演讲者折叠/合并(对不起,不确定哪个术语在这里最合适),并保留演讲者的该部分的开始和结束时间。输出应该如下所示:
**Timestamp Content Speaker**
00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
00:00:13.690-00:00:14.880 ContentContent Speaker 2
00:00:17.000-00:00:30.510 Content Speaker 1
00:00:30.740-00:00:41.930 Content Speaker 2
我认为我不能使用groupby
,因为它不会保留扬声器的层次顺序,例如,它会将扬声器1的所有行放在顶部,然后将扬声器2的所有行放在更低的位置。根据我的理解,我不能用merge
或concatenate
。我有一种感觉,这将是一个比我预期的要复杂得多的操作。任何帮助/建议/指点将不胜感激。提前谢谢你。
您可以使用Groupby.agg
并使用连续行之间的不相等作为分组:
(df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum(),
as_index=False
)
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
)
输出:
Timestamp Content Speaker
0 00:00:09.340-00:00:13.690 ContentContentContent Speaker 1
1 00:00:14.550-00:00:14.880 ContentContent Speaker 2
2 00:00:17.000-00:00:30.510 Content Speaker 1
3 00:00:30.740-00:00:41.930 Content Speaker 2
update: reworking the Timestamps:
df2 = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
.assign(Timestamp=lambda d: (s:=d['Timestamp'].str.extract('-(.*)', expand=False))
.shift(fill_value='00:00:00.000')
+'-'+s
)
)
输出:
Timestamp Content Speaker
Speaker
1 00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
2 00:00:13.690-00:00:14.880 ContentContent Speaker 2
3 00:00:14.880-00:00:30.510 Content Speaker 1
4 00:00:30.510-00:00:41.930 Content Speaker 2
您可以比较相邻的单元格并过滤df
由于评论,这里更新。这样做我建立我的答案@mozway的答案,因为groupby
是现在需要的(我认为)
df[['start', 'end']] = df['Timestamp'].str.split('-',expand=True)
res = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first', 'start':'first', 'end':'last'})
.assign(Timestamp= lambda x: x[['start', 'end']].apply('-'.join, axis=1))
.drop(['start', 'end'],axis=1)
)
print(res)
Timestamp Content Speaker
Speaker
1 00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
2 00:00:13.690-00:00:14.880 ContentContent Speaker 2
3 00:00:17.000-00:00:30.510 Content Speaker 1
4 00:00:30.740-00:00:41.930 Content Speaker 2
老回答:
res = df[df['Speaker'].ne(df['Speaker'].shift(-1))]
print(res)
Timestamp Content Speaker
2 00:00:09.340-00:00:13.690 Content Speaker 1
4 00:00:14.550-00:00:14.880 Content Speaker 2
5 00:00:17.000-00:00:30.510 Content Speaker 1
6 00:00:30.740-00:00:41.930 Content Speaker 2
工作原理如下:
mask = df['Speaker'].ne(df['Speaker'].shift(-1))
print(mask)
0 False
1 False
2 True
3 False
4 True
5 True
6 True
Name: Speaker, dtype: bool
当扬声器发生变化时,掩码仅为True
,因此您将始终获得每个扬声器的最后一行。