我有一大块文本在某些句点后面缺少空格。但是,文本中也包含十进制数字。
以下是到目前为止我使用regex(我使用的是python(解决问题的方法:
re.sub(r"(?!d.d)(?!. ).", '. ', my_string)
但第一个逃生小组似乎不起作用。它仍然匹配十进制数字中的句点。
以下是确保任何潜在解决方案有效的示例文本:
this is a.match
this should also match.1234
and this should 123.match
this should NOT match. Has space after period
this also should NOT match 1.23
您可以使用
re.sub(r'.(?!(?<=d.)d) ?', '. ', text)
请参阅regex演示。尾部空间是可选匹配的,因此如果它在那里,它将被移除并放回原处。
详细信息
.
-一个点(?!(?<=d.)d)
-如果前面的点是两位数之间的点,则不再匹配?
-可选空间
查看Python演示:
import re
text = "this is a.matchnthis should also match.1234nand this should 123.matchnnthis should NOT match. Has space after periodnthis also should NOT match 1.23"
print(re.sub(r'.(?!(?<=d.)d) ?', '. ', text))
输出:
this is a. match
this should also match. 1234
and this should 123. match
this should NOT match. Has space after period
this also should NOT match 1.23
或者,在尝试时使用(?! )
前瞻:
re.sub(r'.(?!(?<=d.)d)(?! )', '. ', text)
请参阅regex演示和Python演示。
另一种方式。。不确定这比Wiktor的解决方案的性能更好还是更差。
re.sub(r"(?!d.d)(?!.. )(..)(.)", r"1 2", my_string)
txt="hello world.this is boise idaho.a this is twin falls."
pattern=r"(w+s*.w+)+"
matches=re.findall(pattern,txt)
for item in matches:
front,back=item.split('.')
replace=front+'. '+back
txt=re.sub(item,replace,txt)
print(txt)
hello world. this is boise idaho. a this is twin falls.