Python Regex非贪婪的方式来匹配/选择引号中的字符串,但字符串有时包含方括号、逗号、反斜杠和停止符



我想匹配这些行第1行:,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]第2行:,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery u0026 Primary School",null,[null,null,12.7849528,78.7159848]第3行:,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]

我的观察结果是,每一行都以逗号(,(开头,以方括号闭合(](结束,三次出现";空";然后是两个小数位从5到16的数字。我只想提取引号中的字符串和末尾的两个带小数位的数字。

我想了一点,但很困惑如何匹配引号,引号有时包括方括号、句号、反斜杠、空格、逗号、减号*这是我完成一半的表情/图案

(r'^,["0x[0-9a-z]{16}:0x[0-9a-z]{16}","(.*?)",null,[null,null,(dd.d{5,16}),(dd.d{5,16})]')

但这行不通。非常感谢您的帮助。

将此正则表达式与标志re.M:一起使用

^,["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,[null,null,(d+.d{5,16}),(d+.d{5,16})]$

参见Regex Demo

上面正则表达式中的大多数内容都很简单。为了匹配带引号的字符串,我假设字符串本身不包含"字符。所以我用。。。

"([^"]*)"

它匹配双引号内的0个或多个非#字符,并将这些字符放在捕获组1中。这是"(.*?)"的一种更有效的替代方案

import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^,["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,[null,null,(d+.d{5,16}),(d+.d{5,16})]$', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])

打印:

SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848

它将匹配行3,因为CCD_;"半字节";(半字节(。

更新

如果你假设每一行都应该匹配,并且你想要一个更宽松的正则表达式,因为行中可能有一些变化(例如,插入的空白(,那么你可能希望使用这个(带有标志re.M(:

^,[^[]*["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"D*(d+.d{5,16}),(d+.d{5,16})
  1. ^匹配线路起点
  2. [^[]*匹配0个或多个非[字符
  3. [匹配[
  4. "0x[a-f0-9]+:0x[a-f0-9]+"匹配由:分隔的任意长度的带引号的十六进制字符串
  5. [^"]*匹配0个或多个非"字符
  6. "([^"]*)"匹配捕获组1中的带引号字符串
  7. D*匹配0个或多个非数字
  8. (d+.d{5,16})匹配捕获组2中的十进制数
  9. ,匹配,
  10. (d+.d{5,16})匹配捕获组3中的十进制数

请参阅Regex演示

import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^,[^[]*["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"D*(d+.d{5,16}),(d+.d{5,16})', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])

打印

SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848
As-Shukoor School 12.7854174 78.7196367

更新2

如果你想真正宽容一点,假设每一行都应该匹配:

^[^"]*"[^"]*"[^"]*"([^"]*)"D*(d+.d+)D*(d+.d+)
  1. ^匹配线路起点
  2. [^"]*"[^"]*"跳到并匹配第一个字符串
  3. [^"]*"([^"]*)"跳到并匹配第二个字符串并放入捕获组1
  4. D*(d+.d+)跳到下一位并捕获捕获组2中的十进制数
  5. D*(d+.d+)跳到下一位并捕获捕获组3中的十进制数
import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^[^"]*"[^"]*"[^"]*"([^"]*)"D*(d+.d+)D*(d+.d+).*$', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])

最新更新