字幕重构以完整的句子结尾



我有以下srt(subtitle(文件:

import pysrt
srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. So
02
00:02:19,000 --> 00:02:24,000
what is the choice of packaging that they prefer when they have to pick up something in a shelf?
03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? What specific
04
00:02:29,000 --> 00:02:34,000
product they will purchase and also what is the brand that they will
05
00:02:34,000 --> 00:02:39,000
prefer. And of course many of the choices that are relevant in the context of marketing.
"""

您可以看到怪异拆分的字幕。我希望每个字幕都以完整的句子结尾,例如:

srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. 
02
00:02:19,000 --> 00:02:24,000
So what is the choice of packaging that they prefer when they have to pick up something in a shelf?
03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? 
04
00:02:29,000 --> 00:02:34,000
What specific product they will purchase and also what is the brand that they will prefer. 
05
00:02:34,000 --> 00:02:39,000
And of course many of the choices that are relevant in the context of marketing.
"""

我想知道如何使用Python实现这一目标。可以使用pysrt打开字幕文本:

import pysrt
srt = """
01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice. So
02
00:02:19,000 --> 00:02:24,000
what is the choice of packaging that they prefer when they have to pick up something in a shelf?
03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping? What specific
04
00:02:29,000 --> 00:02:34,000
product they will purchase and also what is the brand that they will
05
00:02:34,000 --> 00:02:39,000
prefer. And of course many of the choices that are relevant in the context of marketing."""

with open("test.srt", "w") as text_file:
    text_file.write(srt)
sub = pysrt.open("test.srt")
text = sub.text

**编辑:**

基于@chris答案,我尝试了:

from operator import itemgetter
srt = """
    01
    00:02:14,000 --> 00:02:18,000
    understand how customers do their choice. So
    02
    00:02:19,000 --> 00:02:24,000
    what is the choice of packaging that they prefer when they have to pick up something in a shelf?
    03
    00:02:24,000 --> 00:02:29,000
    What is the choice of the store where they will go shopping? What specific
    04
    00:02:29,000 --> 00:02:34,000
    product they will purchase and also what is the brand that they will
    05
    00:02:34,000 --> 00:02:39,000
    prefer. And of course many of the choices that are relevant in the context of marketing.
    """

l = [s.split('n') for s in srt.strip().split('nn')]
whole = ' '.join(map(itemgetter(2), l))
for i, sen in enumerate(re.findall(r'([A-Z][^.!?]*[.!?])', whole)):
    l[i][2] = sen
print('nn'.join('n'.join(s) for s in l))

,但结果我得到了与输入完全相同的...

01
    00:02:14,000 --> 00:02:18,000
    understand how customers do their choice. So
    02
    00:02:19,000 --> 00:02:24,000
    what is the choice of packaging that they prefer when they have to pick up something in a shelf?
    03
    00:02:24,000 --> 00:02:29,000
    What is the choice of the store where they will go shopping? What specific
    04
    00:02:29,000 --> 00:02:34,000
    product they will purchase and also what is the brand that they will
    05
    00:02:34,000 --> 00:02:39,000
    prefer. And of course many of the choices that are relevant in the context of marketing.

我在做什么错?

这有点混乱,可能容易出错,但可以按预期工作:

from operator import itemgetter
l = [s.split('n') for s in srt.strip().split('nn')]
whole = ' '.join(map(itemgetter(2), l))
for i, sen in enumerate(re.findall(r'([A-Z][^.!?]*[.!?])', whole)):
    l[i][2] = sen
print('nn'.join('n'.join(s) for s in l))

输出:

01
00:02:14,000 --> 00:02:18,000
I understand how customers do their choice.
02
00:02:19,000 --> 00:02:24,000
So what is the choice of packaging that they prefer when they have to pick up something in a shelf?
03
00:02:24,000 --> 00:02:29,000
What is the choice of the store where they will go shopping?
04
00:02:29,000 --> 00:02:34,000
What specific product they will purchase and also what is the brand that they will prefer.
05
00:02:34,000 --> 00:02:39,000
And of course many of the choices that are relevant in the context of marketing.

REGEX零件参考:REGEX查找文本的所有句子?

最新更新