长正则表达式模式未按计划工作



我的regex模式似乎在Python中不起作用。此列与电子表格以逗号分隔,逗号之间还有用于分隔事物的管道(|(。不过,我并不担心管道。我需要使用re.split()方法按逗号分割字符串,但是,您会在示例中注意到,用户在第一个|之前的第一个项目中将逗号输入到字符串中——因此,我使用Regex来建立要查找的模式。然而,它不能正常工作,初学者可以使用另一双眼睛。我已经通过Regex101构建并运行了Regex来帮助我,解释似乎是正确的,但它仍然没有返回我期望的比赛次数。

我的正则表达式模式

".+s|sdds|sdddds|sdddds|s.{2}dddds|sd+?.d+?,"gm

我的样本测试字符串

ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0,ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0

我预计的匹配次数:9场匹配

我得到的比赛次数:1场比赛-(0-443(:我从Regex101 导出的比赛

"
.s|sdds|sdddds|sdddds|s.dddds|sd.d,
"
gm
. matches any character (except for line terminators)
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
s matches any whitespace character (equivalent to [rntfv  ])
| matches the character | literally (case sensitive)
s matches any whitespace character (equivalent to [rntfv  ])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
s matches any whitespace character (equivalent to [rntfv  ])
| matches the character | literally (case sensitive)
s matches any whitespace character (equivalent to [rntfv  ])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
s matches any whitespace character (equivalent to [rntfv  ])
| matches the character | literally (case sensitive)
s matches any whitespace character (equivalent to [rntfv  ])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
s matches any whitespace character (equivalent to [rntfv  ])
| matches the character | literally (case sensitive)
s matches any whitespace character (equivalent to [rntfv  ])
. matches any character (except for line terminators)
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
d matches a digit (equivalent to [0-9])
s matches any whitespace character (equivalent to [rntfv  ])
| matches the character | literally (case sensitive)
s matches any whitespace character (equivalent to [rntfv  ])
d matches a digit (equivalent to [0-9])
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
. matches the character . literally (case sensitive)
d matches a digit (equivalent to [0-9])
, matches the character , literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
0-443   ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 ...
Search reference
space
,
.+s|sdds|sdddds|sdddds|s.{2}dddds|sd+?.d+?,
ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0,ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0
ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0,ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0
ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,```

查看数据,如果你不担心管道,如果你想要9个匹配,你可以使用re.findall匹配所有值,而不是拆分和缩短模式:

w+:.*?bd+(?:.d+)(?=,|$)
  • w+:匹配1+字字符和:
  • .*?尽可能少地匹配字符
  • bd+(?:.d+)单词边界,匹配1位以上数字和可选小数部分
  • (?=,|$)断言右边的逗号或字符串末尾

Regex演示| Python演示

import re
from pprint import pprint
pattern = r"w+:.*?bd+(?:.d+)(?=,|$)"
s = "ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0,ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0"
pprint(re.findall(pattern, s))

输出

['ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0',
'ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0',
'ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0',
'ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0',
'ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0',
'ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0',
'ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0',
'ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0',
'ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0']

如果必须使用re.split,可以使用捕获组来保留拆分值并在逗号上进行拆分。

匹配管道的完整模式:

import re
from pprint import pprint
pattern = r"(w+:[^|]+|sdds|(?:sd{4}s|){2}s.{2}d{4}s|s-?d+(?:.d+)?),"
s = "ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0,ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0"
pprint(list(filter(None, re.split(pattern, s))))

输出

['ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0',
'ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0',
'ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0',
'ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0',
'ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0',
'ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0',
'ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0',
'ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0',
'ITSM: Laptops, Desktops, Computers | 30 | 4400 | IT0000 | 720400.0']

Python演示

不要在开始时用贪婪匹配.+s来匹配所有,而是用懒惰匹配.*?s

我还用数量说明符清理了所有重复的d,例如{4}

.*?s|sd{2}s|sd{4}s|sd{4}s|s.{2}d{4}s|s-?d+?.d+?演示,

有八个匹配项:

0-60    ICS: Basic Maintenance | 30 | 5877 | 0000 | IT0000 | 12000.0
60-126  ,ICS: E-Rate discount (85%) | 30 | 5877 | 0000 | IT0000 | -10200.0
126-186 ,ICS: Basic Maintenance | 40 | 5877 | 0000 | IT0000 | 9000.0
186-252 ,ICMS: E-Rate discount (85%) | 40 | 5877 | 0000 | IT0000 | -7650.0
252-313 ,ICS: Basic Maintenance | 20 | 5877 | 0000 | IT0000 | 13500.0
313-379 ,ICS: E-Rate discount (85%) | 20 | 5877 | 0000 | IT0000 | -11475.0
379-442 ,ICCMS: Basic Maintenance | 70 | 5877 | 0000 | IT0000 | 12000.0
442-510 ,ICCMS: E-Rate discount (85%) | 70 | 5877 | 0000 | IT0000 | -10200.0

最后一部分不匹配,因为缺少与sd{4}s的匹配。(下方粗体的0000(

ITSM: Laptops, Desktops, Computers | 30 | 4400| 0000| IT0000 | 720400.0

最新更新