我有父子关系的数据帧,如下所示:
**child Parent relationship** A1x2 bc11 direct_parent bc11 Aw00 direct_parent bc11 Aw00 ultimate_parent Aee1 Aee0 direct_parent Aee1 Aee0 ultimate_parent
我想在新的数据帧中获取所有子节点的所有祖先。结果将如下所示:
node ancesstory_tree A1x2 [A1x2,bc11,Aw00] Aee1 [Aee1,Aee0]
注意:真正的数据集在子级和最终父级之间可能有很多直接的前置节点。
另一种方法,使用networkx
包中的from_pandas_edgelist
和ancestors
:
import networkx as nx
# Create the Directed Graph
G = nx.from_pandas_edgelist(df,
source='Parent',
target='child',
create_using=nx.DiGraph())
# Create dict of nodes and ancestors
ancestors = {n: {n} | nx.ancestors(G, n) for n in df['child'].unique()}
# Convert dict back to DataFrame if necessary
df_ancestors = pd.DataFrame([(k, list(v)) for k, v in ancestors.items()],
columns=['node', 'ancestry_tree'])
print(df_ancestors)
[出]
node ancestry_tree
0 A1x2 [A1x2, Aw00, bc11]
1 bc11 [bc11, Aw00]
2 Aee1 [Aee1, Aee0]
若要从输出表中筛选出"中间子项",可以仅使用out_degree
方法筛选到最后一个子项 - 最后一个子项应具有out_degree== 0
last_children = [n for n, d in G.out_degree() if d == 0]
ancestors = {n: {n} | nx.ancestors(G, n) for n in last_children}
df_ancestors = pd.DataFrame([(k, list(v)) for k, v in ancestors.items()],
columns=['node', 'ancestry_tree'])
[出]
node ancestry_tree
0 A1x2 [A1x2, Aw00, bc11]
1 Aee1 [Aee1, Aee0]
- 创建关系字典
- 逐步完成每个不是
parent
child
- 跟踪祖先路径以及后代
set
- 这很重要,因为如果我们遇到已经看到的节点,我们希望终止 while 循环
relate = dict(zip(df.child, df.Parent))
paths = {}
nodes = {}
for child in cp.keys() - {*cp.values()}:
paths[child] = [child]
nodes[child] = {child}
parent = relate[child]
while parent in relate and parent not in nodes[child]:
paths[child].append(parent)
nodes[child].add(parent)
parent = relate[parent]
paths[child].append(parent)
pd.Series(paths).rename_axis('node').reset_index(name='ancestry_tree')
node ancestry_tree
0 Aee1 [Aee1, Aee0]
1 A1x2 [A1x2, bc11, Aw00]