从pandas数据帧父-子表获取父的所有后代
我想做一些类似于上面的事情,但我希望输出有更多的层次结构,而不是按父母分组。所以child_id总是变成parent_id,除非没有子代;在这种情况下,检查last parent_id
Current Output:
parent_id child_id
0 1000 2010
1 1000 2100
2 1000 2110
3 1000 3000
4 1000 3011
5 1000 3033
6 1000 3102
7 1000 3111
Preferred Output:
parent_id child_id
0 1000 2010
1 2010 3011
2 3011 3050
3 2010 3102
4 2010 4001
5 1000 3000
6 3000 3011
7 3011 3050
8 3000 3033
9 1000 3102
10 1000 3111
etc. etc.
我想出了一些办法。我不知道它是最好的/最快的/最有效的,但它有效。
第一件事是使用上面的脚本创建父-子关系(如果还不存在(,并添加一个名为level的列,该列描述了零件在树0中的最高级别。然后
#this part of the script will create a row for each branch of the tree
dfsort = df[df['level'] == 0][['parent_id','child_id']].rename(columns = {'parent_id':f'level 0', 'child_id':f'level 1'})
for i in sorted(df['level'].unique()[1:]):
df1 = df[df['level'] == i][['parent_id','child_id']].rename(columns = {'parent_id':f'level {i}', 'child_id':f'level {i+1}'})
dfsort = pd.merge(dfsort,
df,
how = 'left', on = [f'level {i}']
dfsort = dfsort[sorted(dfsort.columns)]
#create a node column to drop duplicates on (in case any similar parent child relations are used across multiple higher level parts
dfsort['Node'] = dfsort.astype(str).apply(list, axis =1 )
dfsort['Node'] = dfsort['Node'].apply(lambda x: [i for i in x if i != 'nan'])
#now that you have this relationship you can break it out in the correct order using another for loop
dfsort2 = pd.DataFrame()
#append a new dataframe with the parent childs from above table one row at a time
for i in range(len(dfsort)):
for l in range(len(dfsort.iloc[i][:-1])):
df = dfsort.iloc[[i]][[f'level {i}',f'level {i-1}', 'Node']].rename(columns = {f'level {l}':'parent_id', f'level {l+1}'})
df['level'] = l
dfsort2 = pd.concat([dfsort2, df])
dfsort2 = dfsort[(dfsort2['parent_id'].notna()) &
(dfsort2['child_id'].notna())]
dfsort2 ['order node index'] = dfsort2 .apply(lambda x: x['Node'].index(x['child_id']), axis = 1)
dfsort2 ['Query Node'] = dfsort2 .apply(lambda x: x['Node'][:x['order node index'] + 1], axis = 1).apply(lambda x: ",".join(x))
del dfsort2 ['order node index'], dfsort2 ['Node']
dfsort2 = dfsort2 .drop_duplicates()