我正在使用python和正则表达式,我正在尝试转换如下所示的字符串:
(1694439,805577453641105408,'"@Bessemerband not reverse gear simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),
进入如下所示的列表:
[
[1694439, 805577453641105408, '"@Bessemerband not reverse gear simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."', 2887640, NULL, NULL, NULL],
[1649240, 805577446758158336, '"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :("', 2911510, NULL, NULL, NULL]
]
这里的主要问题是,如您所见,文本中也有括号,我不想拆分。我已经尝试过类似 ([^)]+)
,但显然这在它发现的第一个)
就分裂了。
有什么线索可以解决这个问题吗?
这是您要查找的输出吗?
big = """(1694439,805577453641105408,'"@Bessemerband not reverse gear simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),"""
small = big.split('),')
print(small)
我正在做的是在),
上拆分,然后像往常一样循环并拆分逗号。我将展示一种当然可以优化的基本方法:
new_list = []
for x in small:
new_list.append(x.split(','))
print(new_list)
现在这样做的缺点是有一个空列表,但你可以稍后删除它。
这是一个简单的正则表达式解决方案,用于捕获不同组中的每个逗号分隔值:
(([^,]*),([^,]*),'((?:\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)
用法:
input_string = r"""(1694439,805577453641105408,'"@Bessemerband not reverse gear simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),"""
import re
result = re.findall(r"(([^,]*),([^,]*),'((?:\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)", input_string)
嵌套括号在这里不是问题,因为它们被括在引号之间。您所要做的就是单独匹配引用的零件:
import re
pat = re.compile(r"[^()',]+|'[^'\]*(?:\.[^'\]*)*'|(()|())", re.DOTALL)
s = r'''(1694439,805577453641105408,'"@Bessemerband not reverse gear simply pointing out that I didn't say what you claim I said. I will absolutely riot if (Brexit) is blocked."',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( "',2911510,NULL,NULL,NULL),'''
result = []
for m in pat.finditer(s):
if m.group(1):
tmplst = []
elif m.group(2):
result.append(tmplst)
else:
tmplst.append(m.group(0))
print(result)
如果您的字符串还可以包含引号之间未括起来的括号,则可以使用带有正则表达式模块的递归模式(使用它和 csv 模块是一个好主意(或构建状态机来解决问题。