Pandas从列表中将数据帧列中的唯一值分配/委派给用户



我有一个电子邮件地址及其域的数据框架。我有一个用户列表1-5

users = [1, 2, 3, 4, 5]

我需要将每个唯一的域分配给一个用户id,我需要确保同一域的倍数总是分配给同一用户id,然而,单个域可以分配给任何用户id,只要这些域在用户之间稍微均匀分布即可。

我的数据帧:

email                  first_name    last_name     domain          
0  krusty@gmail.com       Herschel      Krustofsky    gmail.com       
1  bob@hotmail.com        Robert        Terwilliger   hotmail.com     
2  h.simpson@email.com    Homer         Simpson       email.com       
3  bsimpson@gmail.com     Bart          Simpson       gmail.com       
4  moe@moestavern.com     Moe           Szyslak       moestavern.com   
5  marge@simpson.net      Marge         Simpson       simpson.net     
6  lisa.simpson@sax.com   Lisa          Simpson       sax.com         
7  itchy@hotmail.com      Itchy         And           hotmail.com     
8  scratchy@work.net      Scratchy      Show          work.net        
9  maggie@hotmail.com     Maggie        Simpson       hotmail.com     
10 skinner@teacher.net    Seymour       Skinner       teacher.net     

我想要的结果。

email                  first_name    last_name     domain           user_id
0  krusty@gmail.com       Herschel      Krustofsky    gmail.com        1
1  bob@hotmail.com        Robert        Terwilliger   hotmail.com      2
2  h.simpson@email.com    Homer         Simpson       email.com        3
3  bsimpson@gmail.com     Bart          Simpson       gmail.com        1
4  moe@moestavern.com     Moe           Szyslak       moestavern.com   4
5  marge@simpson.net      Marge         Simpson       simpson.net      5
6  lisa.simpson@sax.com   Lisa          Simpson       sax.com          1
7  itchy@hotmail.com      Itchy         And           hotmail.com      2
8  scratchy@work.net      Scratchy      Show          work.net         3
9  maggie@hotmail.com     Maggie        Simpson       hotmail.com      2
10 skinner@teacher.net    Seymour       Skinner       teacher.net      4

增加用户id可能不是最好的方法,因为我的示例用户5相比之下似乎很低?

首先,要获得作为数据帧的唯一域:

unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
domain
0       gmail.com
1     hotmail.com
2       email.com
3  moestavern.com
4     simpson.net
5         sax.com
6        work.net
7     teacher.net

然后使用numpy和用户列表,您可以为每个域分配5个用户中的一个:

IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
domain  user_id
0       gmail.com        1
1     hotmail.com        2
2       email.com        3
3  moestavern.com        4
4     simpson.net        5
5         sax.com        1
6        work.net        2
7     teacher.net        3

然后,您可以对此进行合并,以获得每行的id:

df.merge(unique, on='domain')

或者使用带有replace:的词典

ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)

email                  first_name    last_name     domain           user_id
0  krusty@gmail.com       Herschel      Krustofsky    gmail.com        1
1  bob@hotmail.com        Robert        Terwilliger   hotmail.com      2
2  h.simpson@email.com    Homer         Simpson       email.com        3
3  bsimpson@gmail.com     Bart          Simpson       gmail.com        1
4  moe@moestavern.com     Moe           Szyslak       moestavern.com   4
5  marge@simpson.net      Marge         Simpson       simpson.net      5
6  lisa.simpson@sax.com   Lisa          Simpson       sax.com          1
7  itchy@hotmail.com      Itchy         And           hotmail.com      2
8  scratchy@work.net      Scratchy      Show          work.net         2
9  maggie@hotmail.com     Maggie        Simpson       hotmail.com      2
10 skinner@teacher.net    Seymour       Skinner       teacher.net      3

(这与你的例子不完全匹配,所以如果我遗漏了什么,请告诉我(。

完整代码:

unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)

最新更新