Pandas -按单元格值折叠行,但保留顺序



我有一个采访的转录,我已经转换成df(我最近才开始学习pandas框架)。df的结构如下:

Timestamp                   Content Speaker
00:00:00.000-00:00:01.100   Content Speaker 1
00:00:03.470-00:00:07.760   Content Speaker 1
00:00:09.340-00:00:13.690   Content Speaker 1
00:00:13.690-00:00:13.760   Content Speaker 2
00:00:14.550-00:00:14.880   Content Speaker 2
00:00:17.000-00:00:30.510   Content Speaker 1
00:00:30.740-00:00:41.930   Content Speaker 2

我想要实现的是跨演讲者折叠/合并(对不起,不确定哪个术语在这里最合适),并保留演讲者的该部分的开始和结束时间。输出应该如下所示:

**Timestamp                 Content                 Speaker**
00:00:00.000-00:00:13.690   ContentContentContent   Speaker 1
00:00:13.690-00:00:14.880   ContentContent          Speaker 2
00:00:17.000-00:00:30.510   Content                 Speaker 1
00:00:30.740-00:00:41.930   Content                 Speaker 2

我认为我不能使用groupby,因为它不会保留扬声器的层次顺序,例如,它会将扬声器1的所有行放在顶部,然后将扬声器2的所有行放在更低的位置。根据我的理解,我不能用mergeconcatenate。我有一种感觉,这将是一个比我预期的要复杂得多的操作。任何帮助/建议/指点将不胜感激。提前谢谢你。

您可以使用Groupby.agg并使用连续行之间的不相等作为分组:

(df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum(),
as_index=False
)
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
)

输出:

Timestamp                Content    Speaker
0  00:00:09.340-00:00:13.690  ContentContentContent  Speaker 1
1  00:00:14.550-00:00:14.880         ContentContent  Speaker 2
2  00:00:17.000-00:00:30.510                Content  Speaker 1
3  00:00:30.740-00:00:41.930                Content  Speaker 2

update: reworking the Timestamps:

df2 = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
.assign(Timestamp=lambda d: (s:=d['Timestamp'].str.extract('-(.*)', expand=False))
.shift(fill_value='00:00:00.000')
+'-'+s
)
)

输出:

Timestamp                Content    Speaker
Speaker                                                             
1        00:00:00.000-00:00:13.690  ContentContentContent  Speaker 1
2        00:00:13.690-00:00:14.880         ContentContent  Speaker 2
3        00:00:14.880-00:00:30.510                Content  Speaker 1
4        00:00:30.510-00:00:41.930                Content  Speaker 2

您可以比较相邻的单元格并过滤df

由于评论,这里更新。这样做我建立我的答案@mozway的答案,因为groupby是现在需要的(我认为)

df[['start', 'end']] = df['Timestamp'].str.split('-',expand=True)
res = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first', 'start':'first', 'end':'last'})
.assign(Timestamp= lambda x: x[['start', 'end']].apply('-'.join, axis=1))
.drop(['start', 'end'],axis=1)
)
print(res)
Timestamp                Content    Speaker
Speaker                                                             
1        00:00:00.000-00:00:13.690  ContentContentContent  Speaker 1
2        00:00:13.690-00:00:14.880         ContentContent  Speaker 2
3        00:00:17.000-00:00:30.510                Content  Speaker 1
4        00:00:30.740-00:00:41.930                Content  Speaker 2

老回答:

res = df[df['Speaker'].ne(df['Speaker'].shift(-1))]
print(res)
Timestamp  Content    Speaker
2  00:00:09.340-00:00:13.690  Content  Speaker 1
4  00:00:14.550-00:00:14.880  Content  Speaker 2
5  00:00:17.000-00:00:30.510  Content  Speaker 1
6  00:00:30.740-00:00:41.930  Content  Speaker 2

工作原理如下:

mask = df['Speaker'].ne(df['Speaker'].shift(-1))
print(mask)
0    False
1    False
2     True
3    False
4     True
5     True
6     True
Name: Speaker, dtype: bool

当扬声器发生变化时,掩码仅为True,因此您将始终获得每个扬声器的最后一行。

相关内容

  • 没有找到相关文章

最新更新