Python:如何使用项目中的开始/结束时间戳对齐两个列表



>我有两个列表,每个列表按start_time排序,并且end_time不与其他项目重叠:

# (word, start_time, end_time)
words = [('i', 5.12, 5.23),
('like', 5.24, 5.36),
('you', 5.37, 5.71),
('really', 7.21, 7.51),
('yes', 8.32, 8.54)]
# (speaker, start_time, end_time)
segments = [('spk1', 0.0, 1.25),
('spk2', 4.75, 6.25),
('spk1', 6.75, 7.75),
('spk2', 8.25, 9.25)]

我想将words中属于segments中每个项目的start_timeend_time范围内的项目分组,并生成如下所示的内容:

res = [('i', 'like', 'you'),
('really'),
('yes')]

这样,res中的每个项目都包含words的所有项目,start_timeend_time落在segments中相应项目的start_timeend_time之间。

我在输入问题时想出了这个解决方案。我想堆积溢出是一个很好的橡皮鸭。但我很想听听是否有更省时的方法。

res = []
cur = 0
for speaker, start, end in segments:
sent = []
for i, (word, word_start, word_end) in enumerate(words[cur:]):
if word_start >= end:
cur = cur + i
break
sent.append(word)
res.append((speaker, start, end, round(end - start, 2), " ".join(sent)))
if len(sent) == len(words[cur:]):
cur = len(words)

单个循环应该很快。

res = [                                         # initialize beforehand
[
seg[0], 
seg[1], 
seg[2], 
round(seg[2] - seg[1], 2), 
'',                                     # with empty speech
] for seg in segments
]
i = 0                                           # index of res
for word in words:                              # for each row of word
if word[1] >= res[i][2]:                    # next speaker?
i = i + 1                               # next res index
if res[i][4]:                               # not empty speech
res[i][4] = res[i][4] + ' ' + word[0]   # space in between
else:                                       # empty speech
res[i][4] = word[0]                     # initialize it

周日快乐!

最新更新