Python Regex:选择两个括号之间的所有内容,忽略中间的括号



我想捕获整个子查询,无论两者之间是否有concat或子字符串函数(即忽略子查询内的另一个括号打开和关闭)。(a)我们不想捕获"加入";作为一个词(b)"别名";不会总是跟在"join"后面。它可以是任何东西(字边界、空格或"join")。词)。

Case 1: select

中没有concat或子字符串函数
  • :(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2 join

  • :(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2

案例2:select

中的Concat函数
  • :(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join

  • :(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2

我试过了:

方法一:re.findall('(select.*?)s[a-zA-Z0-9_]+', input statement)

方法2:由@TheFourthBird建议

import re
pat1 = '(select.*?)s[a-zA-Z0-9_]+'
pat2 = "(select [^()]*(?:(((?>[^()]+|(?1))*)))?[^()]*)[^()n]+"
string1 = "(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"
string2 = "(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"
print(re.findall(pat1, string1))
print(re.findall(pat1, string2))
import regex as re
print(re.findall(pat2, string1))
print(re.findall(pat2, string2))
pattern = re.compile(pat2, re.UNICODE)
print([match.group(0) for match in pattern.finditer(string2)])

输出:

["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
['(select concat(t1.col1, t2.col1, t3.col1) as']
['']
['(t1.col1, t2.col1, t3.col1)']
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]

上面的方法有什么问题:

  • 方法1:适用于情况1,但不适用于情况2。

  • 方法2:仍然不工作!然而,["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2 join "]是最接近预期的。但是,它不应该捕获alias2旁边的内容。

请帮帮我!

您可以首先匹配(select,并可选择使用PyPi regex模块匹配平衡括号。

在模式的末尾,匹配一个空白字符并使用您的字符类。

(select [^()]*(?:(((?>[^()]+|(?1))*)))?[^()]*)s[a-zA-Z0-9_]+

在部分中,模式匹配:

  • (select匹配(select
  • [^()]*可选择匹配除()以外的任何字符
  • (?:非捕获组
    • (Capture组1
      • ((?>[^()]+|(?1))*)匹配(并使用递归模式递归第一个子组(捕获组1),最后匹配)
    • )关闭组1
  • )?关闭非捕获组并使其为可选
  • [^()]*)可选地匹配除括号外的任何字符,然后匹配)
  • s[a-zA-Z0-9_]+匹配字符类
  • 中列出的一个空白字符和1+

查看正则表达式演示和Python演示

例如,使用re.finditer(因为re.findall返回捕获组值):

import regex as re
pattern = r"(select [^()]*(?:(((?>[^()]+|(?1))*)))?[^()]*)s[a-zA-Z0-9_]+"
s = ("In: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2nn"
"Out: (select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2nn"
"In: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2nn"
"Out: (select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2n")
matches = re.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):

print(match.group())

输出
["(select t1.col1 as alias1 from db.tb where t1.col1='val1') alias2"]
["(select concat(t1.col1, t2.col1, t3.col1) as alias1 from db.tb where t1.col1='val1') alias2"]

注意匹配SQL是容易出错的