Pandas数据帧"应用"使用递归lambda函数是否可行



我有一个表示递归父子关系的数据帧。这种情况下的数据被称为"数据";因子家族";

每个因子族都包含多个因子,这些因子经过加权,每个因子族加起来可达100%。

一个因子本身可能是一个因子家族

递归的深度没有限制

例如

a               b               c
10%             40%             50%
|               |
---------       ---------
|   |   |       |   |   |
d   e   f       g   h   i
20% 30% 50%     10% 20% 70%
|
---------
|       |
k       l
60%     40%

我已经用熊猫中的以下数据帧表示了这一点

python
df = pd.DataFrame({
"code": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"weight": [0.1, 0.4, 0.5, 0.2, 0.3, 0.5, 0.1, 0.2, 0.7, 0.6, 0.4],
"parent_code":["", "", "", "a", "a", "a", "b", "b", "b", "h", "h"]
})
df.set_index("code", inplace=True)
df

输出:

|code|weight|parent_code|
|----|------|-----------| 
|a   |0.1   |           |
|b   |0.4   |           |
|c   |0.5   |           |
|d   |0.2   |a          |
|e   |0.3   |a          |
|f   |0.5   |a          |
|g   |0.1   |b          |
|h   |0.2   |b          |
|i   |0.7   |b          |
|j   |0.6   |h          |
|k   |0.4   |h          |
|----|------|-----------| 

然后我添加了一个计算列,它是一个因子的权重乘以其父权重。我称之为terminal_weight

因此,终端节点的终端权重之和(在本例中为c,d,e,f,g,k,l,i(为100%

python
def parent_weight(code, family_factors):
if code in family_factors.index:
return family_factors["weight"][code] * parent_weight(family_factors["parent_code"][code], family_factors)
else:
return 1

df["terminal_weight"] = df.apply(lambda x: parent_weight(x.name, df), axis=1)
df

输出

|code|weight|parent_code|terminal_weight|
|----|------|-----------| --------------|
|a   |0.1   |           |0.100          |
|b   |0.4   |           |0.400          |
|c   |0.5   |           |0.500          |
|d   |0.2   |a          |0.020          |
|e   |0.3   |a          |0.030          |
|f   |0.5   |a          |0.050          |
|g   |0.1   |b          |0.040          |
|h   |0.2   |b          |0.080          |
|i   |0.7   |b          |0.280          |
|j   |0.6   |h          |0.048          |
|k   |0.4   |h          |0.032          |
|----|------|-----------| --------------|

所以我的问题是:有没有更聪明的方法可以做到这一点,这样我就不必定义parent_weight函数了?我能把它放在传递给DataFrame.apply()的lambda函数中吗??

提前感谢

我会这样做,在数据帧的子集上循环,并使用临时列来存储链接的权重和当前测试的父级。注意,我用np.nan值替换了df中的空白字符串。

import pandas as pd
import numpy as np
df = pd.DataFrame({
"code": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"weight": [0.1, 0.4, 0.5, 0.2, 0.3, 0.5, 0.1, 0.2, 0.7, 0.6, 0.4],
"parent_code":[np.nan, np.nan, np.nan, "a", "a", "a", "b", "b", "b", "h", "h"]
})

df['temp'] = df['parent_code']
df['terminal_weight'] = df['weight']

while True:

parents = df[df.temp.notnull()][['temp']].drop_duplicates(keep='first').copy()
if len(parents)==0:
break

parents = df[['code', 'terminal_weight', 'parent_code']].merge(
parents.rename({"temp":"code"}, axis=1),
on="code",
how="inner"
)
parents.rename(
{'terminal_weight':'weight_parent', 'code':'parent_code_temp', 'parent_code':'temp'}, 
axis=1, 
inplace=True
)
df = df.rename({'temp':'parent_code_temp'}, axis=1).merge(
parents, 
on='parent_code_temp', 
how='left'
)
df.drop('parent_code_temp', axis=1, inplace=True)
df["weight_parent"]= df["weight_parent"].fillna(1)
df['terminal_weight'] = df['terminal_weight'] * df["weight_parent"]
df.drop(['weight_parent'], axis=1, inplace=True)

df.drop('temp', axis=1, inplace=True)
print(df)

最新更新