如何从字符串中删除数字,但保持特定组的数字?

  • 本文关键字:数字 字符串 删除 python regex
  • 更新时间 :
  • 英文 :


我想使用python正则表达式从保持编号754和1231的字符串中删除数字,因为它们与税务部分代码754和证券代码1231相关。例如,我有下面的文本数据:

test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""

,我希望输出为:

Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment

我的解决方案是:

test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)

,但是它不把754或1231看作整个组,只删除数字6,8,9。

可以使用

re.sub(r'(754|1231)|[^A-Za-zs]', r'1', text)

参见regex演示。

在这里,(754|1231)匹配并捕获到组1中的7541231数字序列,然后|[^A-Za-zs]匹配除ASCII字母或任何Unicode空白以外的任何字符,并且匹配被替换为组1值(即捕获的内容保留在字符串中)。

注意:如果数字匹配为精确数字,使用数字边界:

re.sub(r'(?<!d)(754|1231)(?!d)|[^A-Za-zs]', r'1', text)

你可以这样写:

rgx = r' *-? *(?<!d)(?!(?:754|1231)(?!d))d+'
re.sub(rgx, '', test)

演示请注意,这将删除所有不需要的空格和连字符以及数字,例如,'7541'将被匹配并替换为空字符串。

正则表达式可以分解如下(我已经用包含空格的字符类替换了初始空格,以便它是可见的)

[ ]*-? *        # match >= 0 spaces, optionally followed by a hyphen, 
# followed by >= 0 spaces
(?<!d)         # negative lookbehind asserts that preceding character is
# not a digit     
(?!             # begin negative lookahead
(?:754|1231)  # match '754' or '1231'
(?!d)        # negative lookahead asserts that next character is
# not a digit
)               # end negative lookahead
d+             # match >= 1 digits

最新更新