如何避免创建不必要的列表



我一直在遇到从文件或任何地方获取一些信息的情况,然后必须通过几个步骤将数据按摩到最终所需的表单。例如:

def insight_pull(file):
    with open(file) as in_f:
        lines = in_f.readlines()
        dirty = [line.split('    ') for line in lines]
        clean = [i[1] for i in dirty]
        cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
        cleanest = [i[0].split() + i[1].split() for i in cleaner]

        with open("Output_File.txt", "w") as out_f:
            out_f.writelines(' '.join(i) + 'n' for i in cleanest)

按照上述示例:

    # Pull raw data from file splitting on '   '.
    dirty = [line.split('    ') for line in lines]
    # Select every 2nd element from each nested list.
    clean = [i[1] for i in dirty]
    # Couple every 2nd element with it's predecessor into a new list.
    cleaner = [[clean[i],clean[i + 1]] for i in range(0, len(clean),2)]
    # Split each entry in cleaner into the final formatted list.
    cleanest = [i[0].split() + i[1].split() for i in cleaner]

看到我无法将所有编辑都放入一行或循环中(因为每个编辑都取决于其之前的编辑),是否有更好的方法来构建这样的代码?

道歉,如果问题有些模糊。任何输入都非常感谢。

生成器表达式

您不想创建多个列表是正确的。您的列表理解的创建整个新列表,浪费内存,您在每个列表上循环!

@vpfb使用gererator的想法是一个很好的解决方案,如果您在代码中有其他位置重复使用发电机。如果您不需要重复使用发电机使用,请发电机表达式。

发电机表达式像发电机一样懒惰,因此,当链条链在一起时,循环将在末端进行一次评估,当称为Writelines时。

def insight_pull(file):
    with open(file) as in_f:
        dirty = (line.split('    ') for line in in_f)    # Combine with next
        clean = (i[1] for i in dirty)
        cleaner = (pair for pair in zip(clean,clean))    # Redundantly silly
        cleanest = (i[0].split() + i[1].split() for i in cleaner)
        # Don't build a single (possibily huge) string with join
        with open("Output_File.txt", "w") as out_f:
            out_f.writelines(' '.join(i) + 'n' for i in cleanest)

离开上述问题直接与您的问题匹配,您可以走得更远:

def insight_pull(file):
    with open(file) as in_f:
        clean = (line.split('    ')[0] for line in in_f)
        cleaner = zip(clean,clean)
        cleanest = (i[0].split() + i[1].split() for i in cleaner)
        with open("Output_File.txt", "w") as out_f:
            for line in cleanest:
                out_f.write(line + 'n')

我从您的示例中假设只有cleanest列表对您来说是任何实际价值,其余的只是中间步骤,并且可以丢弃而毫不犹豫地。

假设是这种情况,为什么不在每个中间步骤中重复使用相同的变量,这样您就不会在内存中保存多个列表?

def insight_pull(file):
    with open(file) as in_f:
        my_list = in_f.readlines()
        my_list = [line.split('    ') for line in my_list]
        my_list = [i[1] for i in my_list]
        my_list = [[my_list[i],my_list[i + 1]] for i in range(0, len(my_list),2)]
        my_list = [i[0].split() + i[1].split() for i in my_list]

    with open("Output_File.txt", "w") as out_f:
        out_f.writelines(' '.join(i) + 'n' for i in my_list)

如果您在性能方面考虑,则在寻找发电机。发电机非常类似于列表,但是它们懒惰地评估,这意味着只有一旦需要就产生每个元素。例如,在以下序列中,我实际上并未创建3个完整列表,每个元素仅评估一次。以下只是生成器的一个示例(正如我所知,您的代码只是您遇到的问题的一个示例,而不是一个具体问题):

# All even values from 2-18
even = (i*2 for i in range(1, 10))
# Only those divisible by 3
multiples_of_3 = (val for val in even if val % 3 == 0)
# And finally, we want to evaluate the remaining values as hex
hexes = [hex(val) for val in multiples_of_3]
# output: ['0x6', '0xc', '0x12']

两个第一个表达式是生成器,最后一个只是列表的理解。当您没有创建中间列表时,这将在有很多步骤时节省很多内存。请注意,发电机不能被索引,只能评估一次(它们只是值流)。

要实现目标,我建议管道处理。我找到了一篇文章,该文章揭示了该技术:发电机管道。

这是我尝试将循环直接转换为管道的尝试。该代码未经测试(因为我们没有数据可以测试),并且可能包含错误。

func名称中的领先f代表过滤器。

def fromfile(name):
    # see coments
    with open(name) as in_f:
        for line in in_f:
            yield line
def fsplit(pp):
    for line in pp: 
        yield line.split('    ')
def fitem1(pp):
    for item in pp: 
        yield item[1]
def fpairs(pp):
    # edited
    for x in pp:
        try:
            yield [x, next(pp)]
        except StopIteration:
            break
def fcleanup(pp):
    for i in pp: 
        yield i[0].split() + i[1].split()
pipeline = fcleanup(fpairs(fitem1(fsplit(fromfile(NAME)))))
output = list(pipeline)

对于现实世界的用法,我将汇总前3个过滤器和接下来的2个过滤器。

相关内容

  • 没有找到相关文章

最新更新