我有三个网络节点。每个都有一个数据帧。在这些数据帧中,其中一个具有两个ID。另外两个各有一个。示例如下:
df(node0(:
| id1 | id2 | col1| col2|
| :-: | :-: | :-: | :-: |
| 1 | A | h | i |
| 2 | B | j | k |
| 3 | C | l | r |
| 4 | D | v | d |
| 5 | E | x | y |
| 6 | F | w | p |
df1(node1(:
| id1 | col3| col4|
| :-: | :-: | :-: |
| 1 | k | h |
| 2 | d | j |
| 3 | e | l |
df2(node2(:
| id2 | col5| col6|
| :-: | :-: | :-: |
| B | k | h |
| C | d | j |
| D | d | j |
| E | e | l |
*请注意,所有col值都无关紧要,可以是任何值。仅关注ID
现在,我想以一种方式合并这三个数据帧:
将df与df1和df2相交,分别具有ID。即df:id1-df1:id1,df:id2-df2:id2
取决于输入、步骤1的交集或并集结果。以生成每个节点的最终结果。在这种情况下:
交叉:
node0:
| id1 | id2 | col1| col2|
| :-: | :-: | :-: | :-: |
| 2 | B | j | k |
| 3 | C | l | r |
节点1:
| id1 | col3| col4|
| :-: | :-: | :-: |
| 2 | d | j |
| 3 | e | l |
节点2:
| id2 | col5| col6|
| :-: | :-: | :-: |
| B | k | h |
| C | d | j |
UNION
node0:
| id1 | id2 | col1| col2|
| :-: | :-: | :-: | :-: |
| 1 | A | h | i |
| 2 | B | j | k |
| 3 | C | l | r |
| 4 | D | v | d |
| 5 | E | x | y |
节点1:
| id1 | col3| col4|
| :-: | :-: | :-: |
| 1 | k | h |
| 2 | d | j |
| 3 | e | l |
| 4 | | |
| 5 | | |
节点2:
| id2 | col5| col6|
| :-: | :-: | :-: |
| A | | |
| B | k | h |
| C | d | j |
| D | d | j |
| E | e | l |
现在,UNION的结果可能看起来很奇怪。但请记住,始终先将基df和其他两个相交。然后将结果进行交集或并集。
最后,在联合中为这些ID填充None是在消耗,但不是在原始的相应数据帧中。
在某些情况下,基本df(node0(的id可能少于df1或df2。其中的想法是相同的,您在基本df中填充None。
重要提示:无论相交或并集,三个输出总是有相同的行,但可以是不同的列。秩序不容破坏
我尝试了几种没有水果的方法。有时秩序混乱。
示例代码:
import pandas as pd
# INTERSECT or UNION
inputs = "UNION"
if inputs == "INTERSECT":
merge_method = 'inner'
elif inputs == "UNION":
merge_method = 'outer'
else:
raise ValueError("inputs must be "INTERSECT" or "UNION" ")
d = {'id1': [1, 2, 3, 4, 5, 6],
'id2': ['A', 'B', 'C', 'D', 'E', 'F'],
'col1': ['h', 'j', 'l', 'v', 'x', 'w'],
'col2': ['i', 'k', 'r', 'd', 'y', 'p']}
'''
# Case where df has less ids than df1 or df2
d = {'id1': [1, 2],
'id2': ['A', 'B'],
'col1': ['h', 'j'],
'col2': ['i', 'k']}
'''
d1 = {'id1': [1, 2, 3],
'col3': ['k', 'd', 'e'],
'col4': ['h', 'j', 'l']}
d2 = {'id2': ['B', 'C', 'D', 'E'],
'col5': ['k', 'd', 'd', 'e'],
'col6': ['h', 'j', 'j', 'l']}
df = pd.DataFrame(d)
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
# Expend base df if it has less ids than df1 or df2
df0 = pd.merge(df, df1, how='outer', on='id1')[df.columns.tolist()]
df0 = pd.merge(df0, df2, how='outer', on='id2')[df.columns.tolist()]
# Intersect df with df1 and df2 with there respective ID. i.e. df: id1 - df1: id1, df: id2 - df2: id2
int_df1 = pd.merge(df0, df1, how='inner', on='id1')[['id1', 'id2']]
int_df2 = pd.merge(df0, df2, how='inner', on='id2')[['id1', 'id2']]
print("df and df1 intersection: n{}ndf and df2 intersection: n{}n".format(int_df1, int_df2))
# Depend on the input, Intersect or Union results from step 1. to produce final results for EACH NODES
merge_df = pd.merge(int_df1, int_df2, how=merge_method, on=['id1', 'id2'])
df_out = pd.merge(merge_df, df0, how='inner', on=['id1', 'id2'])
df1_out = pd.merge(merge_df['id1'], df1, how=merge_method, on='id1')
df2_out = pd.merge(merge_df['id2'], df2, how=merge_method, on='id2')
# Filling None in the union for these IDs are expending but not in original respective dataframe
mask = (~df_out['id1'].isin(df['id1'])) | (~df_out['id2'].isin(df['id2']))
df_out.loc[mask] = df_out.loc[mask].fillna("None")
mask = ~df1_out['id1'].isin(df1['id1'])
df1_out.loc[mask] = df1_out.loc[mask].fillna("None")
mask = ~df2_out['id2'].isin(df2['id2'])
df2_out.loc[mask] = df2_out.loc[mask].fillna("None")
print("merge_df: n{}nnode0: n{}nnode1: n{}nnode2: n{}n".format(merge_df, df_out, df1_out, df2_out))
INTERSECT的输出:
df and df1 intersection:
id1 id2
0 1 A
1 2 B
2 3 C
df and df2 intersection:
id1 id2
0 2 B
1 3 C
2 4 D
3 5 E
merge_df:
id1 id2
0 2 B
1 3 C
node0:
id1 id2 col1 col2
0 2 B j k
1 3 C l r
node1:
id1 col3 col4
0 2 d j
1 3 e l
node2:
id2 col5 col6
0 B k h
1 C d j
UNION的输出:
df and df1 intersection:
id1 id2
0 1 A
1 2 B
2 3 C
df and df2 intersection:
id1 id2
0 2 B
1 3 C
2 4 D
3 5 E
merge_df:
id1 id2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E
node0:
id1 id2 col1 col2
0 1 A h i
1 2 B j k
2 3 C l r
3 4 D v d
4 5 E x y
node1:
id1 col3 col4
0 1 k h
1 2 d j
2 3 e l
3 4 None None
4 5 None None
node2:
id2 col5 col6
0 A None None
1 B k h
2 C d j
3 D d j
4 E e l