用树结构转换数据框架



我有一些树状结构的数据作为DataFrame。

level   id  parent_id  type text
0      1    1       <NA>  node    a
1      2   11          1  node    b
2      2   12          1  node    c
3      2   13          1  leaf    d
4      3  111         11  leaf    e
5      3  121         12  leaf    f
6      3  122         12  leaf    g

我想得到一个数据框架看起来像这样:


level       1              2                3            leaf           
attributes id  type text  id  type text    id  type text   id  type text
0           1  node    a  11  node    b   111  leaf    e  111  leaf    e
1           1  node    a  12  node    c   121  leaf    f  121  leaf    f
2           1  node    a  12  node    c   122  leaf    g  122  leaf    g
3           1  node    a  13  leaf    d  <NA>   NaN  NaN   13  leaf    d

我当前的解决方案是这样的:

from functools import reduce
def join_fn(x, y):
i, df1 = x
j, df2 = y
return (
j,
pd.merge(df1, df2, left_on=f"id_{i}", right_on=f"parent_id_{j}", how="outer"),
)
dfs = list(df.groupby("level"))
dfs = [
(i, df.rename(columns={col: col + f"_{i}" for col in df.columns})) for i, df in dfs
]
_, dfr = reduce(join_fn, dfs)
dfr = dfr.filter([col for col in dfr.columns if col.startswith(("id", "text", "type"))])
idx = dfr.columns.str.split("_", expand=True)
dfr.columns = idx.swaplevel()

将产生以下内容:


1              2                 3           
id  type text  id  type text     id  type text
0  1  node    a  11  node    b  111.0  leaf    e
1  1  node    a  12  node    c  121.0  leaf    f
2  1  node    a  12  node    c  122.0  leaf    g
3  1  node    a  13  leaf    d    NaN   NaN  NaN

我如何获得最后三列,即收集叶子的列?

此外,我对当前代码的改进持开放态度。

这是一个可能的解决方案:

def merge(ldf, rdf, lsuffix, rsuffix=None):
return ldf.merge(
rdf,
how='right',
left_on='parent_id',
right_on='id',
suffixes=(lsuffix, rsuffix),
).drop(
columns=[f'parent_id{lsuffix}', f'level{lsuffix}'],
)
df = df[df.columns[::-1]]
res = df[df['level'] == df['level'].max()]
for lev in range(df['level'].max() - 1, 1, -1):
res = merge(res, df[df['level'] == lev], f'_{lev + 1}')
res = merge(res, df[df['level'] == 1], '_2', '_1')
res = res.drop(columns=['parent_id_1', 'level_1'])
res = res[res.columns[::-1]]
for prefix in ('id', 'type', 'text'):
sub_res = res[[c for c in res.columns if c.startswith(prefix)]]
sub_res[f'{prefix}_leaf'] = [pd.NA] * len(sub_res)
res[f'{prefix}_leaf'] = sub_res.ffill(axis=1)[f'{prefix}_leaf']

相关内容

  • 没有找到相关文章

最新更新