我尝试使用Python和Re匹配段落。
文本示例:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam Nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, Sed Diam Voluptua.At vero eos et accusam et justo duo dolores et ea 啪啪啪。
此处有两个或多个换行符
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum 多洛尔坐着。
此处有两个或多个换行符
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy Eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed 迪亚姆·沃鲁普图亚。At vero eos et accusam et justo duo dolores et ea 啪啪啪。Stet clita kasd gubergren, no sea takimata sanctus est Lorem Ipsum dolor sit amet.
这个表达式似乎几乎可以完成这项工作:
paragraphs = re.findall(r'(?s)((?:[^n][n]?)+)', textContent)
但我想确保仅在有两个或多个换行符时才匹配。目前它匹配得太频繁了。
编辑:
ART. WEFWEFEW
1 SDVSDRG: **<at the momemnt it breaks here, but it shouldnt>**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
编辑2:
ART. WEFWEFEW
1 SDVSDRG:
**here are two line-breaks, but dont split this paragraph**
**at the momemnt it breaks here, but it shouldnt**
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth
查看 RegEx101 上的正则表达式(?m)(?:.+(?:n.)?)+
,您还可以在其中获得它的解释。
使用此正则表达式的示例 Python 代码:
import re
import pprint
textContent = '''Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et
accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem
ipsum dolor sit amet.
Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna
aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren, no
sea takimata sanctus est Lorem ipsum dolor sit amet.
ART. WEFWEFEW
1 SDVSDRG:
a. wevvdfvdfd
b. sdfsdfsdfsdfsdfsdghtrhrth'''
pprint.pprint(re.findall(r'(?m)(?:.+(?:n.)?)+', textContent))
输出:
['Lorem ipsum dolor sit amet, consetetur sadipscing elitr,n'
'sed diam nonumy eirmod tempor invidunt ut labore et doloren'
'magna aliquyam erat, sed diam voluptua. At vero eos etn'
'accusam et justo duo dolores et ea rebum.',
'Stet clita kasd gubergren, no sea takimata sanctus est Loremn'
'ipsum dolor sit amet.',
'Ipsum dolor sit amet, consetetur sadipscing elitr, sed diamn'
'nonumy eirmod tempor invidunt ut labore et dolore magnan'
'aliquyam erat, sed diam voluptua. At vero eos et accusam etn'
'justo duo dolores et ea rebum. Stet clita kasd gubergren, non'
'sea takimata sanctus est Lorem ipsum dolor sit amet.',
'ART. WEFWEFEWn'
' 1 SDVSDRG:n'
' a. wevvdfvdfdn'
' b. sdfsdfsdfsdfsdfsdghtrhrth']
Rextester 上的演示。