在大型数据集上优化迭代



我有两个大型数据集df1df2,它们都有一列记录每次观测的时间。我想求出df1每一项与df2每一项的时间差

下面的代码工作,但运行到内存错误时,我试图在整个数据集上运行它。如何优化内存效率?

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")
LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1
for i in range(1000) :
for j in range(1000) :
LINE_NUMBER_table0.append(i) # Add the number of row i of table0
LINE_NUMBER_table1.append(j) # Add the number of row j of table1 
timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE

你不需要一个循环。Python循环通常效率低下(特别是在Pandas数据框架上迭代,请参阅本文)。您需要使用矢量化调用。例如,Numpy函数或Pandas的函数。在这种情况下,您可以使用np.tilenp.repeat。下面是一个(未经测试的)示例:

import numpy as np
df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")
tmp = np.arange(1000)
LINE_NUMBER_table0 = np.repeat(tmp, 1000)
LINE_NUMBER_table1 = np.tile(tmp, 1000)
df1_mjd = np.repeat(df1["mjd"].to_numpy(), 1000)
df2_MJD = np.tile(df2["MJD"].to_numpy(), 1000)
TIME_DIFFERENCE = df1_mjd - df2_MJD

请注意,您可以使用your_array.tolist()将Numpy数组转换回列表,但为了性能起见,最好使用Numpy数组(请注意,Pandas内部使用Numpy数组,因此与列表相比,Pandas数据集和Numpy数组之间的转换成本较低)。

最新更新