我有一些树状结构的数据作为DataFrame。
level id parent_id type text
0 1 1 <NA> node a
1 2 11 1 node b
2 2 12 1 node c
3 2 13 1 leaf d
4 3 111 11 leaf e
5 3 121 12 leaf f
6 3 122 12 leaf g
我想得到一个数据框架看起来像这样:
level 1 2 3 leaf
attributes id type text id type text id type text id type text
0 1 node a 11 node b 111 leaf e 111 leaf e
1 1 node a 12 node c 121 leaf f 121 leaf f
2 1 node a 12 node c 122 leaf g 122 leaf g
3 1 node a 13 leaf d <NA> NaN NaN 13 leaf d
我当前的解决方案是这样的:
from functools import reduce
def join_fn(x, y):
i, df1 = x
j, df2 = y
return (
j,
pd.merge(df1, df2, left_on=f"id_{i}", right_on=f"parent_id_{j}", how="outer"),
)
dfs = list(df.groupby("level"))
dfs = [
(i, df.rename(columns={col: col + f"_{i}" for col in df.columns})) for i, df in dfs
]
_, dfr = reduce(join_fn, dfs)
dfr = dfr.filter([col for col in dfr.columns if col.startswith(("id", "text", "type"))])
idx = dfr.columns.str.split("_", expand=True)
dfr.columns = idx.swaplevel()
将产生以下内容:
1 2 3
id type text id type text id type text
0 1 node a 11 node b 111.0 leaf e
1 1 node a 12 node c 121.0 leaf f
2 1 node a 12 node c 122.0 leaf g
3 1 node a 13 leaf d NaN NaN NaN
我如何获得最后三列,即收集叶子的列?
此外,我对当前代码的改进持开放态度。
这是一个可能的解决方案:
def merge(ldf, rdf, lsuffix, rsuffix=None):
return ldf.merge(
rdf,
how='right',
left_on='parent_id',
right_on='id',
suffixes=(lsuffix, rsuffix),
).drop(
columns=[f'parent_id{lsuffix}', f'level{lsuffix}'],
)
df = df[df.columns[::-1]]
res = df[df['level'] == df['level'].max()]
for lev in range(df['level'].max() - 1, 1, -1):
res = merge(res, df[df['level'] == lev], f'_{lev + 1}')
res = merge(res, df[df['level'] == 1], '_2', '_1')
res = res.drop(columns=['parent_id_1', 'level_1'])
res = res[res.columns[::-1]]
for prefix in ('id', 'type', 'text'):
sub_res = res[[c for c in res.columns if c.startswith(prefix)]]
sub_res[f'{prefix}_leaf'] = [pd.NA] * len(sub_res)
res[f'{prefix}_leaf'] = sub_res.ffill(axis=1)[f'{prefix}_leaf']