防止spaCy将段落编号拆分为句子



我使用spaCy对使用段落编号的文本进行句子分割,例如:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

我试图迫使spaCy的分词器不要将3.拆分为自己的句子。

目前,以下代码返回三个单独的句子:

nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)

返回:

**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.

我一直试图通过在解析器之前将自定义规则传递到管道中来阻止这种情况的发生:

if token.text == r'd.':
doc[token.i+1].is_sent_start = False

这似乎没有任何效果。以前有人遇到过这个问题吗?

类似的东西?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")

输出:

**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?

**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?

参考:span.merge((

最新更新