根据Python中的列值过滤大型CSV文件(10GB )

编辑：添加复杂性

我有一个大的CSV文件，我想根据列值过滤行。例如，考虑以下CSV文件格式：

Col1,Col2,Nation,State,Col4...
a1,b1,Germany,state1,d1...
a2,b2,Germany,state2,d2...
a3,b3,USA,AL,d3...
a3,b3,USA,AL,d4...
a3,b3,USA,AK,d5...
a3,b3,USA,AK,d6...

我想用Nation == 'USA'过滤所有行，然后基于50个状态中的每个行。这样做的最有效方法是什么？我正在使用Python。谢谢

此外，对于此类任务，R比Python好吗？

使用boolean indexing或DataFrame.query：

df1 = df[df['Nation'] == "Japan"]

或：

df1 = df.query('Nation == "Japan"')

第二个应该更快，请参见查询的性能。

如果仍然不可能（不多的RAM）尝试使用Dask如注释的Jon Clements（谢谢）。

一种方法是首先过滤CSV，然后加载，考虑到数据的大小

import csv
with open('yourfile.csv', 'r') as f_in:
    with open('yourfile_edit.csv', 'w') as f_outfile:
        f_out = csv.writer(f_outfile, escapechar=' ',quoting=csv.QUOTE_NONE)
        for line in f_in:
            line = line.strip()
            row = []
            if 'Japan' in line:
                row.append(line)
                f_out.writerow(row)

现在加载CSV

df = pd.read_csv('yourfile_edit.csv', sep = ',',header = None)

你得到

    0   1   2   3       4
0   2   a3  b3  Japan   d3

您可以打开文件，索引Nation标头的位置，然后在reader()上迭代。

import csv
temp = r'C:pathtofile'
with open(temp, 'r', newline='') as f:
    cr = csv.reader(f, delimiter=',')
    # next(cr) gets the header row (row[0])
    i = next(cr).index('Nation')
    # list comprehension through remaining cr iterables
    filtered = [row for row in cr if row[i] == 'Japan']

相关内容

最新更新

热门标签：