如何使用正则表达式从python的片段中抓取整个句子



我有一个vtt文件如下

WEBVTT
1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about
2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.
3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes
4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language.

我想从文件中提取片段并将它们合并成句子。输出应该像这样

['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'

我能够使用这个

提取片段
pattern = r"[A-z0-9 ,.*?='";n-/%$#@!()]+"
content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]

我不知道如何提取整个句子而不是片段。

还要注意,这只是vtt文件的一个示例。整个vtt文件包含大约630个片段,其中一些片段还包含整数和其他特殊字符

感谢您的帮助

使用re.sub,我们可以首先尝试删除不需要的重复文本。然后,执行第二次替换,将剩余的换行符替换为单个空格:

inp = """1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about
2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.
3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes
4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language."""
output = re.sub(r'(?:^|r?n)d+r?nd{2}:d{2}:d{2}.d{3} --> d{2}:d{2}:d{2}.d{3}r?n', '', inp)
output = re.sub(r'r?n', ' ', output)
sentences = re.findall(r'(.*?.)s*', output)
print(sentences)

这个打印:

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
'Regular expressions or regexes are written in a condensed formatting language.']

我发现@Tim Biegeleisen的解决方案与复杂的正则表达式和多个替换有点混乱,所以这里有另一个选择。

import re
_file = """1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about
2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.
3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes
4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language.
"""
non_fragments = re.compile(r'$|d+($|:d+.* --> d+.*$)')
full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
sentences = full_text.split('. ')

这回报:

print(full_text)
In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language.
print(sentences)
["In this lecture, we're going to talk about pattern matching in strings using regular expressions", 'Regular expressions or regexes are written in a condensed formatting language.']

作为一个额外的(小)奖励,这个选项至少比使用re.sub/re.findall

的选项快两倍。在预编译正则表达式时最有效。没有使用很大的样本进行测试。

%%timeit
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')
6.75 µs ± 831 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

但如果我们在每次迭代中包含re_compile处理,速度会更快

%%timeit
non_fragments = re.compile(r'$|d+($|:d+.* --> d+.*$)')
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')  
7.97 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这个至少两倍长。

%%timeit
output = re.sub(r'(?:^|r?n)d+r?nd{2}:d{2}:d{2}.d{3} --> d{2}:d{2}:d{2}.d{3}r?n', '', _file)
output = re.sub(r'r?n', ' ', output)
sentences = re.findall(r'(.*?.)s*', output)
15.2 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

您还可以匹配文件中的数据结构,以确保它存在。

^d+r?nd{2}:d{2}:d{2}.d{3} -->.*r?n((?:(?!d+r?ndd:).*(?:r?n|$))*)

Expanation

  • ^字符串
  • 起始
  • d+r?nd{2}:d{2}:d{2}.d{3} -->.*匹配1+数字、换行符和类时模式
  • r?n匹配换行符
  • (Capture组1
    • (?:非抓包组
      • (?!d+r?ndd:).*(?:r?n|$)如果不以类似时间的模式开始,则匹配整行
    • )*关闭组并重复0+次以匹配所有行
  • )关闭组1

查看在线正则表达式演示| Python演示

匹配列表中re.findall返回的捕获组中时间模式后的所有文本。

然后将所有部分连接为空字符串,用空格替换换行符,并在点后分割1个或多个空白字符。

示例代码

regex = r"^d+r?nd{2}:d{2}:d{2}.d{3} -->.*r?n((?:(?!d+r?ndd:).*(?:r?n|$))*)"
content = [i for i in re.split(r"(?<=.)s+", re.sub(r"[rn]+", " ", "".join(re.findall(regex, text, re.M)))) if i]
print(content)

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.", 'Regular expressions or regexes are written in a condensed formatting language.']

最新更新