我有一个如下所示的数据框架。此数据框包含名称和地址信息。
l_name f_name m_name city_state zip
smith sam auburn,wa 98910
smith sam c auburn,wa 98910
smith sam durham,nc 27659
smith sam eddie auburn,wa 98910
smith sam auburn,wa 98910
:
:
aaron joe chris ocala,fl 58910
aaron joe ocala,fl 58910
aaron joe durham,nc 27659
aaron joe ocala,fl 58910
aaron joe a ocala,fl 58910
我试图创建一个代码,可以组/标记他们到同一组,如果他们的l_name, f_name, city_state,和zip列都匹配。
这是我希望的输出数据帧。
l_name f_name m_name city_state zip tag
smith sam auburn,wa 98910 1
smith sam c auburn,wa 98910 1
smith sam durham,nc 27659 0
smith sam eddie auburn,wa 98910 1
smith sam auburn,wa 98910 1
clair may seattle,wa 98092 2
clair may a seattle,wa 98092 2
clair may seattle,wa 98092 2
:
:
aaron joe chris ocala,fl 58910 n
aaron joe ocala,fl 58910 n
aaron joe durham,nc 27659 0
aaron joe ocala,fl 58910 n
aaron joe a ocala,fl 58910 n
如果该人不属于任何组,它将给其分配一个0。Sam Smith符合条件,所以他们都在同一组,即1。因此,标签/组2将是may clair, joe aaron将是n组,这取决于数据框中存储了多少个名字。
我想知道是否有任何好的方法,方法或建议我创建一个代码,可以输出我想要的结果上面?
谢谢!
你可以用>1计数并获得组号(和+1),然后对于计数为1的组,设置tag = 0。然后连接。
df1 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')>1].copy()
df1['tag'] = df1.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup() + 1
df2 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')==1].copy()
df2['tag'] = 0
df_final = pd.concat([df1,df2])
df_final
l_name f_name m_name city_state zip tag
0 smith sam x auburn,wa 98910 2
1 smith sam c auburn,wa 98910 2
3 smith sam eddie auburn,wa 98910 2
4 smith sam x auburn,wa 98910 2
5 aaron joe chris ocala,fl 58910 1
6 aaron joe x ocala,fl 58910 1
8 aaron joe x ocala,fl 58910 1
9 aaron joe a ocala,fl 58910 1
2 smith sam x durham,nc 27659 0
7 aaron joe x durham,nc 27659 0
您可以使用ngroup
的组合,然后计算其值:
# create a ID variable with a unique identifier for each group
groupvar = df.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup(ascending=False) + 1
# assign it to the data
df = df.assign(tag=groupvar)
df
l_name f_name m_name city_state zip tag
0 smith sam auburn,wa 98910 2
1 smith sam c auburn,wa 98910 2
2 smith sam durham,nc 27659 1
3 smith sam eddie auburn,wa 98910 2
4 smith sam auburn,wa 98910 2
5 clair may seattle,wa 98092 3
6 clair may a seattle,wa 98092 3
7 clair may seattle,wa 98092 3
8 aaron joe chris ocala,fl 58910 4
9 aaron joe ocala,fl 58910 4
10 aaron joe durham,nc 27659 5
11 aaron joe ocala,fl 58910 4
12 aaron joe a ocala,fl 58910 4
# determine which IDs occur only once and assign 0 to them
df.loc[df['tag'].value_counts()[df['tag']].values == 1, 'tag'] = 0
df
l_name f_name m_name city_state zip tag
0 smith sam auburn,wa 98910 2
1 smith sam c auburn,wa 98910 2
2 smith sam durham,nc 27659 0
3 smith sam eddie auburn,wa 98910 2
4 smith sam auburn,wa 98910 2
5 clair may seattle,wa 98092 3
6 clair may a seattle,wa 98092 3
7 clair may seattle,wa 98092 3
8 aaron joe chris ocala,fl 58910 4
9 aaron joe ocala,fl 58910 4
10 aaron joe durham,nc 27659 0
11 aaron joe ocala,fl 58910 4
12 aaron joe a ocala,fl 58910 4