正则表达式来标识python函数体并定位所有可执行行(即非注释)



我有一个文件是python代码(可能语法不正确(。

除了签名之外,它还有一些被注释掉的功能。

我的目标是使用正则表达式检测那些空函数并将其清除。

如果它只是#类型的注释,那么如果所有行的开头都有#,位于以def开头的两行之间,则会更容易定位,但问题是在许多函数中,我也有多行注释(实际上是docstring(。

如果你能提出一种将多行评论改为单行评论的方法,那也会有所帮助。

如果你想知道这有什么用处,这是python工具的一部分,我们正在尝试自动化代码重构的一些步骤。

输入:

def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def fuly_commented_fucntion(f, g, K):
"""
remove this empty function.
Examples
========
>>> which function is
>>> empty
"""
def empty_annotated_fn(name: str, result: List[100]) -> List[100]:    
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]
def empty_with_both_types_of_comment(f, K):
"""
my bla bla
Examples
========
3
"""
# if not f:
# else:
#    return max(dup_abs(f, K))
SOME_VAR = 6

预期输出:

def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]
SOME_VAR = 6

我建议您不要尝试使用regex来实现这一点。

Python语法不是一种常规语言,即使在您只对语法的一小部分感兴趣的情况下,也有太多可能的变体和角落,因此不值得尝试使用regex来做到这一点。

相反,我建议您探索令人敬畏的ast模块,它可以有效地解析源代码并将代码作为树进行迭代。然后,您可以检查所有函数定义,看看它们是否有有效的代码行。

例如,您可以实现一个自定义NodeTransformer,它可以删除实际上为空的函数定义。你需要正确地定义";"空";,但根据你的问题,我认为它是任何只有docstring或pass...(省略号(的函数。

import ast
class Cleaner(ast.NodeTransformer):
def __init__(self):
self.removed = []
def visit_FunctionDef(self, node):
for stmt in node.body:
if isinstance(stmt, ast.Pass):
continue
if isinstance(stmt, ast.Expr) and isinstance(stmt.value, ast.Constant):
const = stmt.value.value
if isinstance(const, str) or const is Ellipsis:
continue
break
else:
self.removed.append(node.name)
return None
return node
def visit_AsyncFunctionDef(self, node):
return self.visit_FunctionDef(node)
with open("my/path/to/file.py", "r") as source:
tree = ast.parse(source.read())
cleaner = Cleaner()
cleaner.visit(tree)
print(cleaner.removed)    # ['fuly_commented_fucntion', 'empty_with_both_types_of_comment']
print(ast.unparse(tree))  # will print your source code without those functions

这种方法有一些局限性,您应该注意:

  • ast不适用于语法错误的源
  • ast.parse会忽略并删除注释,因此如果您取消分析它,则所有注释都将消失
  • 函数体可能没有实现,而且它可以在代码中的某个地方被引用,因此仅通过检查函数体是否为空来重构函数是不安全的
  • 此实现不检查嵌套函数。这是可以做到的(只需在访问者方法中调用self.generic_visit(node)(,但这会引发一个问题:一个主体只有空嵌套函数的函数本身是空的吗

你可以做的一件事,而不是对树进行解译,就是只使用它来识别未实现函数的名称,然后使用正则表达式来查找和删除它们的定义(例如,请参阅下面@megaltron的答案(

好的。这是我尝试在python文件(例如data.py(上使用regex来生成预期的输出。它可能不会涵盖所有可能想到的python文件,然而,概念验证对所提供的数据做得很好。代码需要更新以适应import statements

这是我的代码:

import re
# Import the python file to be processed (eg. data.py)
with open("data.py", "r") as f:
python_file = f.read()
# A function to enumerate an iterator
def enum_iterable(iterator):
i = 0
for it in iterator:
yield (i, it)
i += 1

# Find all lines that are not within a definition
non_def_pattern = re.compile(r"(n((?!def)(?!s))[^n]+)")
s = non_def_pattern.split(python_file)
str_list = list(filter(None, s))
non_definition_lines = "".join([item for item in str_list if item.startswith('n')])
# Retain the lines that ARE within a definition
definition_lines = "n".join([item for item in str_list if not item.startswith('n')])
# Split the definition lines by definition
def_pattern = re.compile(r'(def[^n]+n)')
match = def_pattern.finditer(definition_lines)
def_dict = {}
for m, val in enum_iterable(match):
def_dict.update({m: val})
split_def_lines = def_pattern.split(definition_lines)
# Remove blank element in first position if it exists
if split_def_lines[0] == '':
split_def_lines.pop(0)
# Identify blocks that contain code
good_functions = ""
commBlock_pattern = re.compile(r'("{3})[^"]+("{3})')
for i, val in enumerate(split_def_lines):
if i%2 == 1:
if '"""' in val:
if len(commBlock_pattern.findall(val)) > 0:
result = commBlock_pattern.sub("", val)
# remove all spaces
result = result.replace(" ", "")
# remove lines starting with #
result = re.sub(r'((s+)?#[^n]+n)', "", result)
# remove new lines
result = result.replace("n", "")
# If there is any remaining text, then add the function to good_functions
if len(result) > 0:
good_functions = good_functions + split_def_lines[i-1] + val
# Now add the non-def lines to the end of good functions
final_output = good_functions + non_definition_lines
print(final_output)

输出:

def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]

SOME_VAR = 6

使用以下正则表达式:

(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:n.+)+)

?!否定方法

(?:n.+)+)断线

以下代码中的match.group(groupNum)包含string中的功能

完整的代码

import re
#regex
regex = r"(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:n.+)+)"
test_str = ("n"
"def this_function_has_stuff(f, g, K):n"
"    """ Thisfunction has stuff in it """n"
"    if f:n"
"       s = 0n"
"    else:n"
"       u =0n"
"    return Nonenn"
"def fuly_commented_fucntion(f, g, K):n"
"    """n"
"    remove this empty function.n"
"    Examplesn"
"    ========n"
"    >>> which function isn"
"    >>> emptyn"
"    """nn"
"def note_this_has_one_valid_line(f, K):n"
"    """n"
"    Make some bla.n"
"    Examplesn"
"    ========n"
"    >>> bla blan"
"    >>> bla blan"
"    x**2 + 1n"
"    """n"
"    return [K.abs(coff) for coff in f]nn"
"def empty_with_both_types_of_comment(f, K):n"
"    """n"
"    my bla blan"
"    Examplesn"
"    ========n"
"    3n"
"    """n"
"    # if not f:n"
"    # else:n"
"    #    return max(dup_abs(f, K))nn"
"SOME_VAR = 6")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):

for groupNum in range(0, len(match.groups())):
print('==============your methods=====================')
groupNum = groupNum + 1        
print (match.group(groupNum))


最新更新