如何将列绑定的函数并行化



我有一个函数,它对每个DataFrame列执行一些操作,并从中提取一个较短的序列(在原始代码中,有一些耗时的计算正在进行(然后,它在继续下一列之前将其添加到字典中。

最后,它从字典中创建一个数据帧,并操纵其索引。

如何并行处理每个列的循环?

这是一个不那么复杂的可复制代码示例。

import pandas as pd
raw_df = pd.DataFrame({"A":[ 1.1 ]*100000, 
"B":[ 2.2 ]*100000,
"C":[ 3.3 ]*100000})
def preprocess_columns(raw_df, ):

df = {}
width = 137 

for name in raw_df.columns:
'''
Note: the operations in this loop do not have a deep sense and are just for illustration of the function preprocess_columns. In the original code there are ~ 50 lines of list comprehensions etc.
'''

# 3. do some column operations. (actually theres more than just this operation)
seriesF =  raw_df[[name]].dropna()
afterDropping_indices = seriesF.index.copy(deep=True) 
list_ = list(raw_df[name])[width:]  
df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:]) 


# create df from dict and reindex
df=pd.concat(df,axis=1) 
df=df.reindex(df.index[::-1])
return df 

raw_df = preprocess_columns(raw_df )

也许你可以使用这个:https://github.com/xieqihui/pandas-multiprocess

pip install pandas-multiprocess
from pandas_multiprocess import multi_process

args = {'width': 137}
result = multi_process(func=func, data=df, num_process=8, **args)

相关内容

  • 没有找到相关文章

最新更新