生成带有反斜杠的串联行,但不包括注释块

  • 本文关键字:不包括 注释 python generator
  • 更新时间 :
  • 英文 :


当前正在尝试创建一个生成器函数,该函数一次生成一个文件行,同时忽略注释块并将末尾带有反斜杠的行与以下行连接起来。因此,对于这段文本:

# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 
line3.2 
line3.3>
<line4.1 
line4.2>
<line5># comment 
# more comment1 
more comment2>
<line6>
# here's a comment line continued to the next line 
this line is part of the comment from the previous line

理想的输出是:

<line0>
<line1>
<line2>
<line3.1 line3.2 line3.3>
<line4.1 line4.2>
<line5>
<line6>

这是我迄今为止的代码:

try:
file_name = open('path/to/file.txt', 'r')
except FileNotFoundError:
print("File could not be found. Please check spelling of file name!")
sys.exit()
#Read lines in file
Lines = file_name.read().splitlines()
class FileLineGen:
def get_filelines(path: str) -> Iterator[str]:
for line in Lines:
#Exclude a line if it starts with #
if line.startswith("#"):
line.replace(line, "")
continue
if "#" in line:
#Split at where the # is located
line.split('#')
#Yield everything before the comment block
yield line.split('#')[0]
continue
if line.endswith('\'):
#Yield everything but the backslash
line = line[:-1]
yield line
continue
#Yield the line in all other cases
else:
yield line
gen = get_filelines(file_name)
for line in Lines:
print(next(gen))

这会产生以下输出:

<line0>
<line1>
<line2>
<line3.1 
line3.2 
line3.3>
<line4.1 
line4.2>
<line5>
more comment2>
<line6>
this line is part of the comment from the previous line

因此,我已经能够删除反斜杠,但我尝试了各种连接,但都无济于事。理想的逻辑是先将反斜杠与下一行连接起来,这样,如果行的开头有一个#,那么该行将被自动排除(并且后面的注释不会包含在输出中(。

编辑:使用FileLineGen类中的with块打开文件的新输出:

with open('/path/to/file.txt') as f:
for line in my_generator(f):
print(line)
<line0>
<line1>
<line2>
<line3.1 line3.2 line3.3>
<line4.1 line4.2>
<line5>
<line6>

您有两个运算符,#。后者优先于前者。这意味着你应该先检查和处理它。这里有一个简单的方法可以使用列表作为缓冲区来建立行:

def my_generator(f):
buffer = []
for line in f:
line = line.rstrip('n')
if line.endswith('\'):
buffer.append(line[:-1])
continue
line = ''.join(buffer) + line
buffer = []
if '#' in line:
line = line[:line.index('#')]
if line:
yield line

包装一个可迭代的行并使用ducktype的好处是,您可以传入任何行为类似字符串容器的内容,而不仅仅是文本文件:

text = """# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 
line3.2 
line3.3>
<line4.1 
line4.2>
<line5># comment 
# more comment1 
more comment2>
<line6>
# here's a comment line continued to the next line 
this line is part of the comment from the previous line'"""
for line in my_generator(text.splitlines()):
print(line)

结果如预期:

<line0>
<line1>
<line2>
<line3.1 line3.2 line3.3>
<line4.1 line4.2>
<line5>
<line6>

另一种写循环的方法是

print('n'.join(my_generator(text.splitlines())))

我建议使用re.sub方法。

def line_gen(text: str):
text = re.sub(r"s+\n", '', text)   # Remove any  break
text = re.sub(r"#(.*)n", 'n', text) # Remove any comment
# If the last line it is a comment it won't have a final n.
# We have to remove it as well.
text = re.sub(r"#.*", '', text) 
for line in text.rsplit():  # Using rsplit here we get ride of all unwanted spaces.
yield line

with open("/tmp/data.txt") as f:
text = f.read()
for line in line_gen(text):
print(line)

数据的内容.txt

# this entire line is a comment - don't include it in the output
<line0>
# this entire line is a comment - don't include it in the output
<line1># comment
<line2>
# this entire line is a comment - don't include it in the output
<line3.1 
line3.2 
line3.3>
<line4.1 
line4.2>
<line5># comment 
# more comment1 
more comment2>
<line6>
# here's a comment line continued to the next line 
this line is part of the comment from the previous line

结果:

<line0>
<line1>
<line2>
<line3.1line3.2line3.3>
<line4.1line4.2>
<line5>
<line6>

最新更新