如何在pandas中使用矢量化而不是for循环



我正在尝试为我的工作建立一个机器学习算法。我用于训练和测试的数据有17k行和20列。我试着在其他两个列的基础上添加一个新列,但是我写的for循环太慢了(执行3秒)

for i in range(0, len(model_olculeri)):
if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
model_olculeri["Waist to Hip Ratio"][i] = sum_column

我读了关于pandas和numpy矢量化的文章,而不是pandas数据框架上的for循环,似乎它更快更有效。我如何为我的for循环实现这种向量化?非常感谢。

创建布尔掩码并使用它进行过滤:

m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]

替代:

model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]

或者在numpy.where:

中设置新值
model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)

querypipe链解

model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio =  x.Bel/x.Basen)

最新更新