在优化将if函数应用于数据帧时,我是不是用慢的方法?(Python,Pandas)



很长一段时间以来,这里的第一个问题是最近重新开始使用Python。我一直在用panda清理/准备一些数据,我发现当将一个函数应用于总数据(约30000000行(的较小样本(500000行(时,运行我的特定代码块需要很长时间(约8分钟(。我的想法是,我已经写了一些有效的东西,但对于我试图做的事情来说不是很理想,而且当应用于整个数据集时,这将成为一个非常漫长的过程。我不完全确定,但我认为像alteryx这样的程序运行这种东西会更快,所以我想我一定做错了什么。任何能让它更快的帮助或想法都非常感谢!

数据帧示例:

po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})

功能:

def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - None Received"
elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - Items Received"
elif df[received_quant] > df[ordered_quant]:
df[output_col] = "Order Over Fufilled"
elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
df[output_col] = "Order Partially Fufilled"
elif df[received_quant] == df[ordered_quant]:
df[output_col] = "Order Fully Fufilled"
elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
df[output_col] = "Order Not Fufilled"
else:
df[output_col] = "Error"
return df

函数调用:

po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)

使用Pandas和Numpy的最快方法是向量化函数。使用for循环、列表理解或apply((在数组或序列中逐元素运行函数是一种糟糕的做法。

我只举一个";取消订单":

def order_cancelled(a, b):
## define your function logic however you want
return a - b

然后向量化你的函数:

df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])

最新更新