基于直接句子的文本分割



假设我有一个这样的docx文件:

当我还是个小男孩的时候,我父亲带我进城看 军乐队。 他说:"儿子,长大后你会成为破碎者的救世主吗? 父亲坐在我身边,双臂抱着我的肩膀。 我说"我会的"。 我父亲回答说:"那是我的孩子!

我想根据直接句子对 docx 进行细分。喜欢这个:

1:他说:"儿子,长大后,你会成为救世主吗? 坏了?

sent2 : 我说"我会的"。

sent3 : 我父亲回答说:"那是我的孩子!

我尝试使用正则表达式。 结果是这样的

When I was a young boy my father took me into the city to see a marching band.
He said, "Son when you grow up would you be the savior of the broken?
".
My father sat beside me, hugging my shoulders with both of his arms.
I said "I Would.
".
My father replied "That is my boy!

正则表达式代码 :

import re
SENTENCE_REGEX = re.compile('[^!?.]+[!?.]')
text = open ('text.docx','r')
def parse_sentences(text):
return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]
def print_sentences(sentences):
print ("nn".join(sentences))
if __name__ == "__main__":
print_sentences(parse_sentences(text))
import re
txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''
pttrn = re.compile(r'(.|?|!)('|")?s')
new = re.sub(pttrn, r'12nn', txt)
print(new)

输出:

When I was a young boy my father took me into the city to see a marching band.
He said, "Son when you grow up would you be the savior of the broken?".
My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would."
My father replied "That is my boy!"

附注: 据我所知,英语中不允许使用?"..".!".这样的结尾。

最新更新