Python RegEx嵌套搜索和替换



我需要进行RegEx搜索并替换引号块内的所有逗号。
例如

"thing1,blah","thing2,blah","thing3,blah",thing4  

需要变成

"thing1,blah","thing2,blah","thing3,blah",thing4  
我代码:

inFile  = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
    pg = p.search(line)
    # found comment block
    if pg:
        q  = re.compile(r'[^\],')
        # found comma within comment block
        qg = q.search(pg.group(0))
        if qg:
            # Here I want to reconstitute the line and print it with the replaced text
            #print re.sub(r'([^\]),',r'1,',pg.group(0))

我需要根据RegEx过滤我想要的列,进一步过滤,
然后进行RegEx替换,然后重新构成返回的行。

我如何在Python中做到这一点?

csv模块非常适合解析这样的数据,因为默认方言中的csv.reader忽略引号逗号。由于存在逗号,csv.writer重新插入引号。我用StringIO给一个字符串一个类似文件的接口。

import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
    wtr.writerow([item.replace('\,',',').replace(',','\,') for item in row])
print dest.getvalue()
结果:

"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"

通用编辑

There was

"thing1\,blah","thing2\,blah","thing3\,blah",thing4   
问题中的

,现在它不在那里了。

而且,我没有注意到r'[^\],'

所以,我完全重写了我的答案。

"thing1,blah","thing2,blah","thing3,blah",thing4               

"thing1,blah","thing2,blah","thing3,blah",thing4

是字符串的显示(我想)

import re

ss = '"thing1,blah","thing2,blah","thing3,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\),') ):
    return ri.sub('\\',mat.group())
print ss
print repr(ss)
print
print      regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
结果

"thing1,blah","thing2,blah","thing3,blah",thing4 
'"thing1,blah","thing2,blah","thing3\,blah",thing4 '
"thing1blah","thing2blah","thing3,blah",thing4 
'"thing1\blah","thing2\blah","thing3\,blah",thing4 '

你可以试试这个正则表达式。


>>> re.sub('(?<!"),(?!")', r"\,", 
                     '"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1,blah","thing2,blah","thing3,blah",thing4

这背后的逻辑是用,代替,,如果"之前和之后没有立即出现

我想出了一个使用几个正则表达式函数的迭代解决方案:
Finditer (), findall(), group(), start()和end()
有一种方法可以将所有这些转换为调用自身的递归函数。
什么人吗?

outfile  = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\])(,)')
for line in outfileRl:
    pg = p.finditer(line)
    pglen = len(p.findall(line))
    if pglen > 0:
        mpgstart = 0;
        mpgend   = 0;
        for i,mpg in enumerate(pg):
            if i == 0:
                outfile.write(line[:mpg.start()])
            qg    = q.finditer(mpg.group(0))
            qglen = len(q.findall(mpg.group(0)))
            if i > 0 and i < pglen:
                outfile.write(line[mpgend:mpg.start()])
            if qglen > 0:
                for j,mqg in enumerate(qg):
                    if j == 0:
                        outfile.write( mpg.group(0)[:mqg.start()]    )
                    outfile.write( re.sub(r'([^\])(,)',r'1\2',mqg.group(0)) )
                    if j == (qglen-1):
                        outfile.write( mpg.group(0)[mqg.end():]      )
            else:
                outfile.write(mpg.group(0))
            if i == (pglen-1):
                outfile.write(line[mpg.end():])
            mpgstart = mpg.start()
            mpgend   = mpg.end()
    else:
        outfile.write(line)
outfile.close()

您看过str.replace()吗?

str。替换(旧的,新的[,count])返回包含所有子字符串old的字符串副本替换为新的。如果给出了可选参数count,则只有第一次计数的出现将被替换。

这里有一些文档

希望能有所帮助

最新更新