我们有一个100MB以管道分隔的文件,其中有5列/4个分隔符,每个分隔符由管道分隔。然而,在少数行中,第二列有一个额外的管道。对于这几行,总分隔符为5。
例如,在下面的4行中,第3行是有问题的,因为它有一个额外的管道。
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
是否有办法从第二个位置删除额外的管道,其中该行的分隔符计数为5。因此,更正后,文件需要如下所示:
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
请注意,文件大小为100 MB,任何帮助都是感激的。
来源:my_file.txt
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
代码# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
for line in file_src.readlines():
# Split the line by the character '|'
line_list = line.split('|')
if len(line_list) <= 5:
# If the number of columns doesn't exceed, just write the original line as is.
file_dst.write(line)
else:
# If the number of columns exceeds, count the number of columns that should be merged.
to_merge_columns_count = (len(line_list) - 5) + 1
# Merge the columns from index 1 to index x which includes all the columns to be merged.
merged_column = "".join(line_list[1:1+to_merge_columns_count])
# Replace all the items from index 1 to index x with the single merged column
line_list[1:1+to_merge_columns_count] = [merged_column]
# Write the updated line.
file_dst.write("|".join(line_list))
结果:my_file_parsed.txt
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4
像这样的简单正则表达式模式在Python 3.7.3上工作:
from re import compile
bad_pipe_re = compile(r"[ w]+|[ w]+(|)[ w]+|[ w]+|[ w]+|[ w]+n")
with open("input", "r") as fp_1, open("output", "w") as fp_2:
line = fp_1.readline()
while line is not "":
mo = bad_pipe_re.fullmatch(line)
if mo is not None:
line = line[:mo.start(1)] + line[mo.end(1):]
fp_2.write(line)
line = fp_1.readline()