Python:正则表达式匹配括号内的任何内容(也是其他括号)



我正在使用python和正则表达式,我正在尝试转换如下所示的字符串:

(1694439,805577453641105408,'"@Bessemerband not reverse gear  simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),

进入如下所示的列表:

[
    [1694439, 805577453641105408, '"@Bessemerband not reverse gear  simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."', 2887640, NULL, NULL, NULL],
    [1649240, 805577446758158336, '"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :("', 2911510, NULL, NULL, NULL]
]

这里的主要问题是,如您所见,文本中也有括号,我不想拆分。我已经尝试过类似 ([^)]+) ,但显然这在它发现的第一个)就分裂了。

有什么线索可以解决这个问题吗?

这是您要查找的输出吗?

big = """(1694439,805577453641105408,'"@Bessemerband not reverse gear  simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),"""
small = big.split('),')
print(small)

我正在做的是在),上拆分,然后像往常一样循环并拆分逗号。我将展示一种当然可以优化的基本方法:

new_list = []
for x in small:
    new_list.append(x.split(','))
print(new_list)

现在这样做的缺点是有一个空列表,但你可以稍后删除它。

这是一个简单的正则表达式解决方案,用于捕获不同组中的每个逗号分隔值:

(([^,]*),([^,]*),'((?:\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)

用法:

input_string = r"""(1694439,805577453641105408,'"@Bessemerband not reverse gear  simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),"""
import re
result = re.findall(r"(([^,]*),([^,]*),'((?:\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)", input_string)

嵌套括号在这里不是问题,因为它们被括在引号之间。您所要做的就是单独匹配引用的零件:

import re
pat = re.compile(r"[^()',]+|'[^'\]*(?:\.[^'\]*)*'|(()|())", re.DOTALL)
s = r'''(1694439,805577453641105408,'"@Bessemerband not reverse gear  simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),'''
result = []
for m in pat.finditer(s):
    if m.group(1):
        tmplst = []
    elif m.group(2):
        result.append(tmplst)        
    else:
        tmplst.append(m.group(0))
print(result)

如果您的字符串还可以包含引号之间未括起来的括号,则可以使用带有正则表达式模块的递归模式(使用它和 csv 模块是一个好主意(或构建状态机来解决问题。

相关内容

最新更新