将下一行追加到上一行

>我有这种格式的 5GB 文件：

dn: cn
changetype: add
objectclass: ine
hghsfgdsdsdsd
mail: surcom
surname: satya2
givenname: surya2
cn: surya2

dn: cn
changetype: add
objectclass: inetOrgPerson
surname: sa
sddsds
givenname: s
cn: sur

如您所见，Object 类和姓氏正在进入下一行。我想在同一行。下面的代码实现了它，但它为大文件抛出了内存错误，您能否更改此代码，使其有效地适用于大文件。

import re
pattern = re.compile(r"(w+):(.*)")
with open("uservolvo2.ldif", "r") as f:
    new_lines = []
    for line in f:
        if line.endswith('n'):
            line = line[:-1]
        if line == "":
            new_lines.append(line)
            continue    
        l = pattern.search(line)
        if l:
            new_lines.append(line)
        else:
            new_lines[-1] += line
with open("user_modified.ldif", "a") as f:
    f.write("n".join(new_lines))
    f.write("nn")

也许不是写一个大字符串，当你加入new_lines时，这可能会导致内存错误，你可以迭代列表并一行一行地写

with open("file_modified.txt", "a") as f:
    for line in new_lines:
        f.write(line+'n')

我不知道

基于正则表达式的解决方案会有多高效，也没有对其进行基准测试，但这里有一种可能的方法，在整个文件上使用re.sub：

input = """objectclass: ine
hghsfgdsdsdsd
mail: surcom
surname: satya2"""
output = re.sub(r'objectclass:(s*S+)(.*?)surname:(s*S+)',
                "objectclass:\1nsurname:\3\2", input, flags=re.DOTALL)
print(output)

这将打印：

objectclass: ine
surname: satya2
hghsfgdsdsdsd
mail: surcom

上面的逻辑是匹配一个objectclass:行，然后是所有内容，直到到达surname:行。然后，我们按照您想要的顺序将文本拼凑在一起，紧跟在objectclass之后surname。

我认为最有效的方法是创建另一个空文本文件(modified.txt(，iter原始文本文件并将处理后的行附加到新文件中。

with open('file.txt', 'r') as file, open('modified.txt', 'a') as modified:
    line = file.readline()
    while line:
        line = file.readline()
        #do procssing
        modified.write(line)

相关内容

最新更新

热门标签：