对于以下字符串:
Which one of following pairs of gases is the major cause of greenhouse effect?
A. ( C O_{2} ) and ( O_{3} )
в. ( C O_{2} ) and ( C O )
c. ( C F C ) and ( S O_{2} )
D. ( C O_{2} ) and ( N_{2} O )
我想要一些类似的东西:
Which one of following pairs of gases is the major cause of greenhouse effect?
A. ( CO2 ) and ( O3 )
в. ( CO2 ) and ( CO )
c. ( CFC ) and ( SO2 )
D. ( CO2 ) and ( N2O )
我用re.sub('[A-Z]_{[0-9]}', '<CHEM>', text)
作为实验,这样我就可以把两者结合起来。我怎么能把整个方程式组合在一起呢?每个元素由一个空格分隔,并且每个元素可以是大写字母和/或由1个或多个字母组成。它可能类似于:
( Na Cl_{2} ) and ( Fe k_{3} cl )
->( NaCl2 ) and ( Fek3cl )
您可以将捕获组与re.sub:一起使用
re.sub(r'([A-Z][a-z]?)(_{([0-9]+)})? *', r'13', text)
在线试用!
如果你想保留最后一个元素后面的空白,你可以使用
re.sub(r'([A-Z][a-z]?)(_{([0-9]+)})?( *(?=[A-Z]))?', r'13', text)
在线试用!
说明:
([A-Z][a-z]?)(_{([0-9]+)})? *
([A-Z][a-z]?) # Matches chemical names. Captures the name of the chemical in group 1.
(_{([0-9]+)})? # Matches a potential subscript. Captures the number in group 3.
* # Matches trailing whitespace. This causes it to be removed
( *(?=[A-Z]))? # Alternatively, match the whitespace, only if it's followed by a capital letter. This means that it will be removed only if it's followed by a chemical element.
您可以编写
rgx = r'(?<!\()[ _{}](?=[ A-Zd _{}]* \))'
re.sub(rgx, '', str)
演示
正则表达式可以分解如下。
(?<! # begin a negative lookbehind
\( # match '('
) # end negative lookbehind
[ _{}] # match a character in the char class
(?= # begin a positive lookahead
[ A-Zd _{}]* # match zero or more characters in the char class
[ ]\) # match ' )'
) # end positive lookahead
我将空格字符放在字符类([ ]
(中,只是为了使其可见。
您可以使用
import re
text = r"( Na Cl_{2} ) and ( Fe k_{3} cl )"
print( re.sub(r'\(s*([^()]*?)s*\)', lambda x: f'\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \)', text) )
请参阅Python演示,请参阅regex演示详细信息:
\
-一个字符
(
-一个(
字符s*
-零个或多个空白([^()]*?)
-第1组:除)
、(
之外的任何零个或多个字符s*\)
-零个或多个空白,然后是一个)
字符串
lambda x: f'\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \)'
替换将匹配项替换为(
,组1中删除了所有非字母数字字符并替换了)
。