regex或不起作用-我不知道我的模式出了什么问题

我有以下字符串：

2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
2020-04-02

我想把它分开：

myRegex = '^([-d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[d]{0,}|[NnaAOoEe]{0,})([D]{0,})$'

我想要所有的数字，精确匹配(na，nan，none(-大小写和""在第一组中，如：

[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[2020-04-02][]

这是错误的：

[2020-04-02No][thing and Sons]

我想要

[2020-04-02][Nothing and Sons]

如何编写一个正则表达式来检查精确匹配，如"none"-不区分大小写(还应识别"none"、"One"等(？

https://regex101.com/r/HvnZ47/3

关于re.I的以下内容如何：

(None|NaN?|[-d]+)?(.*)

https://regex101.com/r/d4XPPb/3

说明：

(None|NaN?|[-d]+)?
- 非此即彼
- 或者最后一个N是可选的NaN(由于?(，因此它也与NA匹配
- 或数字和短划线一次或多次
- 由于?，整个组()是可选的，这意味着它可能不在那里
(.*)结尾的任意字符

但是，仍然可能存在边缘情况。考虑以下内容：

National Geographic
---Test

将被解析为

[Na][tional Geographic]
[---][Test]

替代方案：

从这里开始，我们可以继续使正则表达式更加复杂，但是，我认为在没有正则表达式的情况下实现自定义解析会简单得多。每行中的循环字符和：

如果它以数字开头，则将所有数字和破折号解析到第1组，其余部分解析到第2组(即，当您命中一个字符时，更改组(
取字符串的前4个字符，如果它们是"0"；无"；，把它们分开。同时确保第5个字符为大写(不区分大小写的line[:4].lower() == "none" and line[4].isupper()(
类似于上述步骤，但对于NA和NaN：
- line[:3].lower() == "nan" and line[3].isupper()
- line[:2].lower() == "na" and line[2].isupper()

以上内容应该会产生更准确的结果，而且应该更容易阅读。

示例代码：

with open("/tmp/data") as f:
lines = f.readlines()
results = []
for line in lines:
# Remove spaces and n
line = line.strip()
if line[0].isdigit() or line[0] == "-":
i = 0
while line[i].isdigit() or line[i] == "-":
i += 1
if i == len(line) - 1:
i = len(line)
break
results.append((line[:i], line[i:]))
elif line[:4].lower() == "none" and line[4].isupper():
results.append((line[:4], line[4:]))
elif line[:3].lower() == "nan" and line[3].isupper():
results.append((line[:3], line[3:]))
elif line[:2].lower() == "na" and line[2].isupper():
results.append((line[:2], line[2:]))
else:
# Assume group1 is missing! Everything is group2
results.append((None, line))
for g1, g2 in results:
print(f"[{g1 or ''}][{g2}]")

数据：

$ cat /tmp/data 
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
NoNeEconomy and Sons
2020-04-02
NAEconomy and Sons
---Test
National Geographic

输出：

$ python ~/tmp/so.py 
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[NoNe][Economy and Sons]
[2020-04-02][]
[NA][Economy and Sons]
[---][Test]
[][National Geographic]

您可以将想要匹配的表达式与简单的|组合起来，但请记住，引擎总是倾向于第一个可能的匹配；因此，您希望将更具体的模式放在首位，然后再回到更通用的情况。

试试这个：

my_re = re.compile(r'^([0-9]{4}-[0-9]{2}-[0-9]{2,}|d+|N(?:aN|one)|)(D.*)$', re.IGNORECASE)

re.IGNORECASE标志表示忽略大小写差异。

此外，请注意，量词{0,}最好写成*；但您希望至少需要一个匹配，或者返回到更通用的模式，所以实际上您需要+(也可以写成{1,}；但同样，更喜欢更简洁的标准表示法(。D周围不需要方括号，因为它已经封装了一个字符类(但如果您想合并两个字符类，如[-d]，您确实需要方括号(。

演示：https://ideone.com/Qwp5ao

最后，请注意，用于命名局部变量的标准Python表示法更喜欢snake_case而不是dromedaryCase。(另请参阅维基百科。(

相关内容

最新更新

热门标签：