优化粘贴循环

我在/myfolder有1000个文件，每个文件是~8Mb，有500K行和2列，如下所示：

file1.txt
Col1 Col2
a 0.1
b 0.3
c 0.2
...
file2.txt
Col1 Col2
a 0.8
b 0.9
c 0.4
...

我需要从所有文件中删除第一列 - Col1并排粘贴所有文件，文件的顺序无关紧要。

我有以下代码正在运行，它已经运行了 4 个小时......无论如何要加快速度？

for i in /myfolder/*; do 
paste all.txt <(cut -f2 ${i}) > temp.txt; 
mv temp.txt all.txt; 
done

预期产出：

all.txt
Col2 Col2 ...
0.1 0.8 ... 
0.3 0.9 ...
0.2 0.4 ...
... ... ...

我认为

如果您并行迭代文件，此任务会容易得多。对于每个文件的每一行，您只需切断第一部分，然后打印结果的串联。

在Python中，这将是类似的

import glob
# Open all *.txt files in parallel
files = [open(fn, 'r') for fn in glob.glob('*.txt')]
while True:
    # Try reading one line from each file, collecting into 'allLines'
    try:
        allLines = [next(f).strip() for f in files]
    except StopIteration:
        break
    # Chop off everything up to (including) the first space for each line
    secondColumns = (l[l.find(' ') + 1:] for l in allLines)
    # Print the columns, interspersing space characters
    print ' '.join(secondColumns)

_{^{唉，将allLines生成器制作似乎不起作用 - 由于某种原因，next调用不会引发StopIteration错误。}}

我不会完全回答。但可能如果你尝试这个，你可能会成功。例如：- 基于第一列合并 4 个文件：

join -1 1 -2 1 temp1 temp2 | join - temp3|join - temp4

因此，您可以编写一个脚本来最初使用所有文件构建命令，最后执行命令。希望这是有用的。

相关内容

最新更新

热门标签：