我有一个包含四列的数据框:parent_serialno, child_serialno, parent_function和child_function。我想构建一个数据框架,其中每行是根父,每列是一个函数,值是该函数的序列号。
例如,数据框架看起来像这样:df = pd.DataFrame(
[['001', '010', 'A', 'B'], ['001', '020', 'A', 'C'], ['010', '100', 'B', 'D'], ['100', '110', 'D', 'E'],
['002', '030', 'A', 'B'], ['002', '040', 'A', 'C']],
columns=['parent_serialno', 'child_serialno', 'parent_function', 'child_function'])
请注意,并非所有函数都包含每个根的后代,但是对于给定的根,每个函数只有一个序列号。根序列号是提前知道的。
我想输出的看起来像一个数据帧:
pd.DataFrame([['001','010','020','100','110'],['002','030','040', np.nan, np.nan]], columns = ['A','B','C','D','E'])
Out[1]:
A B C D E
0 001 010 020 100 110
1 002 030 040 NaN NaN
这篇文章展示了如何获得一个字典层次结构,但我不太关心如何识别树中叶子的位置(即孙子和曾孙),而更关心的是识别每个叶子的根和功能。
使用networkx
来解决这个问题:
# Python env: pip install networkx
# Anaconda env: conda install networkx
# Create a list of tuples of serialno / function
df['parent'] = df[['parent_function', 'parent_serialno']].apply(tuple, axis=1)
df['child'] = df[['child_function', 'child_serialno']].apply(tuple, axis=1)
# Create a directed graph from dataframe
G = nx.from_pandas_edgelist(df, source='parent', target='child',
create_using=nx.DiGraph)
# Find roots and leaves
roots = [node for node, degree in G.in_degree() if degree == 0]
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all path from each root to each leaf
paths = {}
for root in roots:
children = paths.setdefault(root, [])
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
children.extend(path[1:])
children.sort(key=lambda x: x[1])
# Create your final output
out = pd.DataFrame([dict([parent] + children) for parent, children in paths.items()])
输出:
>>> out
A B C D E
0 001 010 020 100 110
1 002 030 040 NaN NaN