Python正则表达式与完整句子相匹配,包括关键词,而不会破坏不会结束句子的时期(.com,美国等)



我正在尝试创建一个将符合包含关键字的完整句子的正则态度。这是一个示例段落:

"已支付的现金税,退款净额为2016年4.12亿美元。《美国税法》对外国子公司的累计收入征收强制性的一次性税,并改变了外国收益缴纳美国税的方式。"

我想匹配包含关键字"子公司"的完整句子。为此,我一直在使用以下正则表达式:

[^.]*?subsidiaries[^.]*.

但是,这只会与"税法对外国子公司的累计收入征收强制性的一次性税,并改变了外国收入受U的方式",因为该表达式开始并以"结束。在我们中。"。有没有办法在表达式中指定我不希望它在特定短语中停止,例如"美国"或" .com"?

我建议用NLTK将文本归为句子,然后检查每个项目中是否存在字符串。

import nltk, re
text = "Cash taxes paid, net of refunds, were $412 million 2016. The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax."
sentences = nltk.sent_tokenize(text)
word = "subsidiaries"
print([sent for sent in sentences if word in sent])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']

仅提取肯定句子(以.结尾)添加and sent.endswith('.')条件:

print([sent for sent in sentences if word in sent and sent.endswith('.')])

您甚至可以检查您是否过滤的单词是否是一个带有正则表达式的整个单词搜索:

print([sent for sent in sentences if re.search(r'b{}b'.format(word), sent)])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']

最新更新