我有一个大的CSV文件~15GB,有180万列和5K行。我需要对文件进行转置,或者是否有一种有效的方法来逐列读取文件。在python 2.7,bash或Matlab中寻找时间和内存效率的解决方案。
CSV structure:
column names increment from f0,f1 to f1800000
each row has 1.8 million enteries with value of either 0 or 1.
---------------------------------------
f0,f1,f2 ......... ,f1800000
---------------------------------------
0,0,1 ......... ,0
1,0,1 ......... ,1
.........
---------------------------------------
这是一种有效的方法,使用 pandas,按小批量处理行:
import pandas as pd
NCOLS = 1.8e6 # The exact number of columns
batch_size = 50
from_file = 'my_large_file.csv'
to_file = 'my_large_file_transposed.csv'
for batch in range(NCOLS//batch_size + bool(NCOLS%batch_size)):
lcol = batch * batch_size
rcol = min(NCOLS, lcol+batch_size)
data = pd.read_csv(from_file, usecols=range(lcol, rcol))
with open(to_file, 'a') as _f:
data.T.to_csv(_f, header=False)