使用python中单词的开始和结束索引从长字符串中提取句子

假设我有一个长文本。

doc = "I was chasing a dog. I ran after it for a long time. 
... 
... 
... 
However, after running for about an hour, I caught the dog"

经过一些处理和计算，我知道"时间"这个词有起始索引i，即doc[i:i+4]="time"。我的问题是，有没有一种有效的方法可以从由单词doc组成的doc变量中提取句子？在这种情况下，我应该得到的句子是

I ran after it for a long time.

那么，使用长字符串中单词的开头和索引，是否可以提取包含该单词的句子？我不想对文档进行句子标记，并对每个句子进行迭代，检查它是否包含单词。主要原因是我可能有很多查询词，因此不想每次用词查询时都要迭代每个句子。

您可以使用regex:解决此问题

import regex
from typing import List

def extract_sentences(doc: str, start_index: int, word_len: int) -> List[str]:
word = doc[start_index:start_index + word_len]
pattern = re.compile(r"(?<=^|.)[^s.][^.]*%s[^.]*(?=.|$)" % word)
return pattern.findall(doc)
if __name__ == '__main__':
doc = """
I was chasing a dog. I ran after it for a long time.
...
...
...
However, after running for about an hour, I caught the dog
"""
print(extract_sentences(doc, 48, 4))
# ['I ran after it for a long time']

这个想法是用这个词来创建一个正则表达式，并提取包含这个词的所有句子。

请注意，我使用了regex而不是re，因为它允许可变的查找长度。这样，单词出现在第一句中的情况可以正确处理，例如

doc = """I ran after it for a long time.
...
...
...
However, after running for about an hour, I caught the dog
"""
print(extract_sentences(doc, 26, 4))
# ['I ran after it for a long time']

这在spaCy中非常容易。

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(...) # use your raw text here
word = doc.char_span(i, i+4)
# word will be None if the char span isn't valid
if word is not None:
sent = word.sent

这假设你的单词与spaCy代币对齐，但这似乎是一个合理的假设。

用正则表达式来标记英语句子是行不通的("我向史密斯先生问好"是一句话(，应该避免。

相关内容

最新更新

热门标签：