假设我有一个长文本。
doc = "I was chasing a dog. I ran after it for a long time.
...
...
...
However, after running for about an hour, I caught the dog"
经过一些处理和计算,我知道"时间"这个词有起始索引i
,即doc[i:i+4]="time"
。我的问题是,有没有一种有效的方法可以从由单词doc组成的doc
变量中提取句子?在这种情况下,我应该得到的句子是
I ran after it for a long time.
那么,使用长字符串中单词的开头和索引,是否可以提取包含该单词的句子?我不想对文档进行句子标记,并对每个句子进行迭代,检查它是否包含单词。主要原因是我可能有很多查询词,因此不想每次用词查询时都要迭代每个句子。
您可以使用regex
:解决此问题
import regex
from typing import List
def extract_sentences(doc: str, start_index: int, word_len: int) -> List[str]:
word = doc[start_index:start_index + word_len]
pattern = re.compile(r"(?<=^|.)[^s.][^.]*%s[^.]*(?=.|$)" % word)
return pattern.findall(doc)
if __name__ == '__main__':
doc = """
I was chasing a dog. I ran after it for a long time.
...
...
...
However, after running for about an hour, I caught the dog
"""
print(extract_sentences(doc, 48, 4))
# ['I ran after it for a long time']
这个想法是用这个词来创建一个正则表达式,并提取包含这个词的所有句子。
请注意,我使用了regex
而不是re
,因为它允许可变的查找长度。这样,单词出现在第一句中的情况可以正确处理,例如
doc = """I ran after it for a long time.
...
...
...
However, after running for about an hour, I caught the dog
"""
print(extract_sentences(doc, 26, 4))
# ['I ran after it for a long time']
这在spaCy中非常容易。
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(...) # use your raw text here
word = doc.char_span(i, i+4)
# word will be None if the char span isn't valid
if word is not None:
sent = word.sent
这假设你的单词与spaCy代币对齐,但这似乎是一个合理的假设。
用正则表达式来标记英语句子是行不通的("我向史密斯先生问好"是一句话(,应该避免。