加快熊猫的处理速度

我正在垂直处理的一列上将一个dataframe与另外 3 个进行比较，我想知道这个过程是否可以使用更多内核/使其更快？我试了concurrent.futures.ProcessPoolExecutor()但实际上慢了 1 秒...... 这是我的代码

# df_out is main DataFrame, hikari_data_df, kokyaku_data_df, hikanshou_data_df are DF to compare 
m1 = df_out[self.col_name_].isin(hikari_data_df['phone_num1'])
m2 = df_out[self.col_name_].isin(hikari_data_df['phone_num2'])
# Add new column to df_out on place of matching m1 with df_out col
df_out['new1'] = df_out[self.col_name_].where(m1)
df_out['new2'] = df_out[self.col_name_].where(m2)
m1 = df_out[self.col_name_].isin(kokyaku_data_df['phone_number1'])
m2 = df_out[self.col_name_].isin(kokyaku_data_df['phone_number2'])
df_out['new3'] = df_out[self.col_name_].where(m1)
df_out['new4'] = df_out[self.col_name_].where(m2)
m1 = df_out[self.col_name_].isin(hikanshou_data_df['phone_number'])
df_out['new5'] = df_out[self.col_name_].where(m1)

df_out.to_csv(sys.argv[1], index=False)

我希望这个过程更快！

首先，如果你的数据不大。尝试将您的"isin"/"where"函数转换为向量操作，如"join/merge"。这将花费更多的内存，但速度要快得多。

第二，使用dask。但是，要小心。如果你的数据不够大。达斯克会变慢。

相关内容

最新更新

热门标签：