>我有两个列表,每个列表按start_time
排序,并且end_time
不与其他项目重叠:
# (word, start_time, end_time)
words = [('i', 5.12, 5.23),
('like', 5.24, 5.36),
('you', 5.37, 5.71),
('really', 7.21, 7.51),
('yes', 8.32, 8.54)]
# (speaker, start_time, end_time)
segments = [('spk1', 0.0, 1.25),
('spk2', 4.75, 6.25),
('spk1', 6.75, 7.75),
('spk2', 8.25, 9.25)]
我想将words
中属于segments
中每个项目的start_time
和end_time
范围内的项目分组,并生成如下所示的内容:
res = [('i', 'like', 'you'),
('really'),
('yes')]
这样,res
中的每个项目都包含words
的所有项目,start_time
和end_time
落在segments
中相应项目的start_time
和end_time
之间。
我在输入问题时想出了这个解决方案。我想堆积溢出是一个很好的橡皮鸭。但我很想听听是否有更省时的方法。
res = []
cur = 0
for speaker, start, end in segments:
sent = []
for i, (word, word_start, word_end) in enumerate(words[cur:]):
if word_start >= end:
cur = cur + i
break
sent.append(word)
res.append((speaker, start, end, round(end - start, 2), " ".join(sent)))
if len(sent) == len(words[cur:]):
cur = len(words)
单个循环应该很快。
res = [ # initialize beforehand
[
seg[0],
seg[1],
seg[2],
round(seg[2] - seg[1], 2),
'', # with empty speech
] for seg in segments
]
i = 0 # index of res
for word in words: # for each row of word
if word[1] >= res[i][2]: # next speaker?
i = i + 1 # next res index
if res[i][4]: # not empty speech
res[i][4] = res[i][4] + ' ' + word[0] # space in between
else: # empty speech
res[i][4] = word[0] # initialize it
周日快乐!