Python-处理行中不均匀的列

我处理的数据有数千行，但列不均匀，如下所示：

AB  12   43   54
DM  33   41   45   56   33   77  88
MO  88   55   66   32   34 
KL  10   90   87   47   23  48  56  12

首先，我想读取列表或数组中的数据，然后找出最长行的长度
然后，我将向短行添加零，使其等于最长的行，这样我就可以将它们作为2D数组进行迭代。

我试过其他几个类似的问题，但都解决不了这个问题。

我相信Python中有一种方法可以做到这一点。有人能帮我吗？

我看不出有什么更简单的方法可以计算出最大行长度，只需进行一次遍历并找到它。然后，我们在第二次遍历中构建2D阵列

from __future__ import print_function
import numpy as np
from itertools import chain
data = '''AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12'''
max_row_len = max(len(line.split()) for line in data.splitlines())
def padded_lines():
    for uneven_line in data.splitlines():
        line = uneven_line.split()
        line += ['0']*(max_row_len - len(line))
        yield line
# I will get back to the line below shortly, it unnecessarily creates the array
# twice in memory:
array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))
array.shape = (-1, max_row_len)
print(array)

此打印：

[['AB' '12' '43' '54' '0' '0' '0' '0' '0']
 ['DM' '33' '41' '45' '56' '33' '77' '88' '0']
 ['MO' '88' '55' '66' '32' '34' '0' '0' '0']
 ['KL' '10' '90' '87' '47' '23' '48' '56' '12']]

上面的代码效率低下，因为它在内存中创建了两次数组。我会回到过去；我想我可以解决这个问题。

但是，numpy数组应该是同构的。您希望将字符串（第一列）和整数（所有其他列）放在同一个2D数组中我仍然认为您在这里走错了路，应该重新思考问题，选择另一种数据结构或以不同的方式组织数据。我不能帮你，因为我不知道你想如何使用这些数据。

（我将很快回到两次创建的阵列问题。）

正如承诺的那样，这是效率问题的解决方案。请注意，我担心的是内存消耗。

    def main():
        with open('/tmp/input.txt') as f:
            max_row_len = max(len(line.split()) for line in f)
        with open('/tmp/input.txt') as f:
            str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
        def padded_lines():
            with open('/tmp/input.txt') as f:
                for uneven_line in f:
                    line = uneven_line.split()
                    line += ['0']*(max_row_len - len(line))
                    yield line
        fmt = '|S%d' % str_len_max
        array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))

这个代码可以做得更好，但我会把它留给你。

使用memory_profiler在随机生成的输入文件上测量的内存消耗，该文件具有1000000行，行长均匀分布在1到20:之间

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.727 MiB    0.000 MiB   @profile
     6                             def main():
     7                                 
     8   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     9   23.727 MiB    0.000 MiB           max_row_len = max(len(line.split()) for line in f)
    10                                     
    11   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
    12   23.727 MiB    0.000 MiB           str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
    13                                 
    14   23.727 MiB    0.000 MiB       def padded_lines():
    15                                     with open('/tmp/input.txt') as f:
    16   62.000 MiB   38.273 MiB               for uneven_line in f:
    17                                             line = uneven_line.split()
    18                                             line += ['0']*(max_row_len - len(line))
    19                                             yield line
    20                                 
    21   23.727 MiB  -38.273 MiB       fmt = '|S%d' % str_len_max
    22                                 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
    23   62.004 MiB   38.277 MiB       array.shape = (-1, max_row_len)

使用代码eumiro的答案，并使用相同的输入文件：

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.719 MiB    0.000 MiB   @profile
     6                             def main():
     7   23.719 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     8  638.207 MiB  614.488 MiB           arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

比较内存消耗增量：我的更新代码消耗的内存是eumiro的16倍（614.488/38.273约为16）。

至于速度：我更新的代码为此输入运行了3.321秒，eumiro的代码运行了5.687秒，也就是说，我的代码在我的机器上快了1.7倍。（您的里程数可能有所不同。）

如果效率是您最关心的问题（正如您的评论"你好，我想这更有效。"所建议的那样，然后更改已接受的答案），那么恐怕您接受了效率较低的解决方案。

不要误解我的意思，eumiro的代码非常简洁，我当然从中学到了很多。如果效率不是我主要关心的问题，我也会选择eumiro解决方案。

您可以使用itertools.izip_longest为您查找最长的行：

import itertools as it
import numpy as np
with open('filename.txt') as f:
    arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

arr现在是：

array([['a', '1', '2', '0'],
       ['b', '3', '4', '5'],
       ['c', '6', '0', '0']], 
      dtype='|S1')

相关内容

最新更新

热门标签：