联接 2 个数据帧并创建父子关系?



我有 2 个父级和子级数据帧,我想按分组方式连接两者

df_parent

parent  parent_value
0   Super Sun             0
1  Alpha Mars             4
2       Pluto             9

df_child

child  value
0         Planet Sun    100
1  one Sun direction    101
2     Ice Pluto Tune    101
3       Life on Mars     99
4         Mars Robot    105
5          Sun Twins    200 

我希望输出井井有条order = ['Sun', 'Pluto', 'Mars']

Sun
-childs
Pluto
-childs
Mards
-childs

我想找到带有关键字的孩子,请参阅parent_dict

parent_dict = {'Super Sun': 'Sun',
'Alpha Mars': 'Mars',
'Pluto': 'Pluto'}

预期产出

child         value
0   Super Sun             0 # parent
1   Planet Sun          100 # child  
2   one Sun direction   101 # child   
3   Sun Twins           200 # child  
4   Pluto                 9 # parent
5   Ice Pluto Tune      101 # child       
6   Alpha Mars            4 # parent
7   Life on Mars         99 # child    
8   Mars Robot          105 # child    

到目前为止,我已经尝试迭代主列表和两个 dfs,但预期的输出没有到来,这是我的代码

output_df = pd.DataFrame()
for o in order:
key = o
for j, row in df_parent.iterrows():
if key in row[0]:
output_df.at[j, 'parent'] = key
output_df.at[j, 'value'] = row[1]
for k, row1 in df_child.iterrows():
if key in row1[0]:
output_df.at[j, 'parent'] = key
output_df.at[j, 'value'] = row[1]              
print(output_df)

输出:

parent  value
0    Sun    0.0
2  Pluto    9.0
1   Mars    4.0

经过一些准备后,您可以将append与两个数据帧一起使用。首先在df_parentdf_child中创建一个列关键字,以便稍后进行排序。为此,您可以使用以下np.select

import pandas as pd
order = ['Sun', 'Pluto', 'Mars']
condlist_parent = [df_parent['parent'].str.contains(word) for word in order]
df_parent['keyword'] = pd.np.select(condlist = condlist_parent, choicelist = order, default = None)
condlist_child = [df_child['child'].str.contains(word) for word in order]
df_child['keyword'] = pd.np.select(condlist = condlist_child, choicelist = order, default = None)

例如df_parent

parent  parent_value keyword
0   Super Sun             0     Sun
1  Alpha Mars             4    Mars
2       Pluto             9   Pluto

现在,您可以使用appendCategorical根据列表order对数据帧进行排序。该rename用于拟合预期输出并用于按需要工作的append(列在两个数据帧中应具有相同的名称(。

df_all = (df_parent.rename(columns={'parent':'child','parent_value':'value'})
.append(df_child,ignore_index=True))
# to order the column keyword with the list order
df_all['keyword'] = pd.Categorical(df_all['keyword'], ordered=True, categories=order)
# now sort_values by the column keyword, reset_index and drop the column keyword
df_output = (df_all.sort_values('keyword')
.reset_index(drop=True).drop('keyword',1)) # last two methods are for cosmetic

然后输出为:

child  value
0          Super Sun      0
1         Planet Sun    100
2  one Sun direction    101
3          Sun Twins    200
4              Pluto      9
5     Ice Pluto Tune    101
6         Alpha Mars      4
7       Life on Mars     99
8         Mars Robot    105

注意:父母在按"关键字"排序后先于孩子,这一事实是df_child被附加到df_parent,而不是相反。

这是一个解决方案,通过迭代两个数据帧,但这似乎是一个非常非常长的过程

output_df = pd.DataFrame()
c = 0
for o in order:
key = o
for j, row in df_parent.iterrows():
if key in row[0]:
output_df.at[c, 'parent'] = row[0]
output_df.at[c, 'value'] = row[1]
c += 1
for k, row1 in df_child.iterrows():
if key in row1[0]:
output_df.at[c, 'parent'] = row1[0]
output_df.at[c, 'value'] = row1[1]              
c += 1

输出:

parent  value
0          Super Sun    0.0
1         Planet Sun  100.0
2  one Sun direction  101.0
3          Sun Twins  200.0
4              Pluto    9.0
5     Ice Pluto Tune  101.0
6         Alpha Mars    4.0
7       Life on Mars   99.0
8         Mars Robot  105.0

考虑通过关键字查找连接数据帧和排序:

order = ['Sun', 'Pluto', 'Mars']
def find_keyword(str_param):    
output = None
# ITERATE THROUGH LIST AND RETURN MATCHING POSITION
for i,v in enumerate(order):
if v in str_param:
output = i
return output
# RENAME COLS AND CONCAT DFs
df_combined = pd.concat([df_parent.rename(columns={'parent':'item', 'parent_value':'value'}),
df_child.rename(columns={'child':'item'})],
ignore_index=True)
# CREATE KEYWORD COL WITH DEFINED FUNCTION
df_combined['keyword'] = df_combined['item'].apply(find_keyword)
# SORT BY KEYWORD AND DROP HELPER COL
df_combined = df_combined.sort_values(['keyword', 'value'])
.drop(columns=['keyword']).reset_index(drop=True)
print(df_combined)
#                 item  value
# 0          Super Sun      0
# 1         Planet Sun    100
# 2  one Sun direction    101
# 3          Sun Twins    200
# 4              Pluto      9
# 5     Ice Pluto Tune    101
# 6         Alpha Mars      4
# 7       Life on Mars     99
# 8         Mars Robot    105

最新更新