使用正则表达式模式的小写文本



我使用regex模式来阻止首字母缩写,而小写文本

代码

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns."
pattern = r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z].)+)'
matches = re.findall(pattern, text)
def lowercase_ignore_matches(match):
word = match.group()
if word in matches:
return word
return word.lower()
text2 = re.sub(r"w+", lowercase_ignore_matches, text)
print(text)
print(text2)
matches = re.findall(pattern, text)
print (matches)

输出
This sentence contains ADS, NASA and K.A. as acronymns.
this sentence contains ADS, NASA and k.a. as acronymns.
['ADS', 'NASA', 'K.A.']

问题是为什么它忽略k.a.而将其识别为首字母缩略词。

我希望保留k.a.作为k.a.

请帮助

r[w.]的解决方案在这种情况下有效,但如果首字母缩略词位于一行的末尾,后面有一个点(即"[…]]或ASDF."我们使用该模式来识别每个首字母缩略词,然后将整个字符串小写,然后再次用其原始值替换首字母缩略词。

我稍微改变了一下模式,以便它也支持像"eFUEL">

这样的缩写词。
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF."
pattern = r'b([A-z]*?[A-Z](?:.?[A-Z])+[A-z]*)'
# Find all matches of the pattern in the text
matches = re.findall(pattern, text)
# Make everything lowercase
text2 = text.lower()
# Replace each match with its original uppercase version
for match in matches:
text2 = text2.replace(match.lower(), match)
print(text2)

结果是:

['ADS', 'NASA', 'K.A', 'eFUEL', 'ASDF']
this sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF.

最新更新