我使用regex模式来阻止首字母缩写,而小写文本
代码
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns."
pattern = r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z].)+)'
matches = re.findall(pattern, text)
def lowercase_ignore_matches(match):
word = match.group()
if word in matches:
return word
return word.lower()
text2 = re.sub(r"w+", lowercase_ignore_matches, text)
print(text)
print(text2)
matches = re.findall(pattern, text)
print (matches)
输出This sentence contains ADS, NASA and K.A. as acronymns.
this sentence contains ADS, NASA and k.a. as acronymns.
['ADS', 'NASA', 'K.A.']
问题是为什么它忽略k.a.
而将其识别为首字母缩略词。
我希望保留k.a.作为k.a.
请帮助
r[w.]
的解决方案在这种情况下有效,但如果首字母缩略词位于一行的末尾,后面有一个点(即"[…]]或ASDF."我们使用该模式来识别每个首字母缩略词,然后将整个字符串小写,然后再次用其原始值替换首字母缩略词。
我稍微改变了一下模式,以便它也支持像"eFUEL">
这样的缩写词。# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
import codecs
import os
import re
text = "This sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF."
pattern = r'b([A-z]*?[A-Z](?:.?[A-Z])+[A-z]*)'
# Find all matches of the pattern in the text
matches = re.findall(pattern, text)
# Make everything lowercase
text2 = text.lower()
# Replace each match with its original uppercase version
for match in matches:
text2 = text2.replace(match.lower(), match)
print(text2)
结果是:
['ADS', 'NASA', 'K.A', 'eFUEL', 'ASDF']
this sentence contains ADS, NASA and K.A. as acronymns. eFUEL or ASDF.