用大熊猫读取大型文本文件



我正在将大型CSV文件25GB读取到pandas.dataframe中。我的PC规格是:

  • Intel Core i7-8700 3.2GHz
  • RAM 16G
  • Windows 10
  • dataframe.shape = 144,000,000行by 13 cols
  • 磁盘上的CSV文件大小说24GB

有时读取此文件需要长时间大约20分钟。有什么建议,代码,我可以做得更好吗?

*注意:总体上需要此DF,因为我要与另一个加入(合并)。

您可以使用dask.dataframe:

import dask.dataframe as dd # import dask.dataframe
df = dd.read_csv('filename.csv') # read csv

或者您可以使用块:

def chunk_processing(): # define a function that you will use on chunks
    ## Do Something # your function code here

chunk_list = [] # create an empty list to hold chunks
chunksize = 10 ** 6 # set chunk size
for chunk in pd.read_csv('filename.csv', chunksize=chunksize): # read in csv in chunks of chunksize
    processed_chunk = chunk_processing(chunk) # process the chunks with chunk_processing() function
    chunk_list.append(processed_chunk) # append the chunks to a list
df_concat = pd.concat(chunk_list) # concatenate the list to a dataframe

最新更新