我有一个电子邮件地址及其域的数据框架。我有一个用户列表1-5
users = [1, 2, 3, 4, 5]
我需要将每个唯一的域分配给一个用户id,我需要确保同一域的倍数总是分配给同一用户id,然而,单个域可以分配给任何用户id,只要这些域在用户之间稍微均匀分布即可。
我的数据帧:
email first_name last_name domain
0 krusty@gmail.com Herschel Krustofsky gmail.com
1 bob@hotmail.com Robert Terwilliger hotmail.com
2 h.simpson@email.com Homer Simpson email.com
3 bsimpson@gmail.com Bart Simpson gmail.com
4 moe@moestavern.com Moe Szyslak moestavern.com
5 marge@simpson.net Marge Simpson simpson.net
6 lisa.simpson@sax.com Lisa Simpson sax.com
7 itchy@hotmail.com Itchy And hotmail.com
8 scratchy@work.net Scratchy Show work.net
9 maggie@hotmail.com Maggie Simpson hotmail.com
10 skinner@teacher.net Seymour Skinner teacher.net
我想要的结果。
email first_name last_name domain user_id
0 krusty@gmail.com Herschel Krustofsky gmail.com 1
1 bob@hotmail.com Robert Terwilliger hotmail.com 2
2 h.simpson@email.com Homer Simpson email.com 3
3 bsimpson@gmail.com Bart Simpson gmail.com 1
4 moe@moestavern.com Moe Szyslak moestavern.com 4
5 marge@simpson.net Marge Simpson simpson.net 5
6 lisa.simpson@sax.com Lisa Simpson sax.com 1
7 itchy@hotmail.com Itchy And hotmail.com 2
8 scratchy@work.net Scratchy Show work.net 3
9 maggie@hotmail.com Maggie Simpson hotmail.com 2
10 skinner@teacher.net Seymour Skinner teacher.net 4
增加用户id可能不是最好的方法,因为我的示例用户5相比之下似乎很低?
首先,要获得作为数据帧的唯一域:
unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
domain
0 gmail.com
1 hotmail.com
2 email.com
3 moestavern.com
4 simpson.net
5 sax.com
6 work.net
7 teacher.net
然后使用numpy和用户列表,您可以为每个域分配5个用户中的一个:
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
domain user_id
0 gmail.com 1
1 hotmail.com 2
2 email.com 3
3 moestavern.com 4
4 simpson.net 5
5 sax.com 1
6 work.net 2
7 teacher.net 3
然后,您可以对此进行合并,以获得每行的id:
df.merge(unique, on='domain')
或者使用带有replace:的词典
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)
email first_name last_name domain user_id
0 krusty@gmail.com Herschel Krustofsky gmail.com 1
1 bob@hotmail.com Robert Terwilliger hotmail.com 2
2 h.simpson@email.com Homer Simpson email.com 3
3 bsimpson@gmail.com Bart Simpson gmail.com 1
4 moe@moestavern.com Moe Szyslak moestavern.com 4
5 marge@simpson.net Marge Simpson simpson.net 5
6 lisa.simpson@sax.com Lisa Simpson sax.com 1
7 itchy@hotmail.com Itchy And hotmail.com 2
8 scratchy@work.net Scratchy Show work.net 2
9 maggie@hotmail.com Maggie Simpson hotmail.com 2
10 skinner@teacher.net Seymour Skinner teacher.net 3
(这与你的例子不完全匹配,所以如果我遗漏了什么,请告诉我(。
完整代码:
unique = pd.DataFrame(df['domain'].drop_duplicates().reset_index(drop=True))
IDs = np.array([1, 2, 3, 4, 5])
unique['user_id'] = np.resize(IDs, unique.shape[0])
ids = {unique.loc[i, 'domain']:unique.loc[i, 'user_id'] for i in range(len(unique))}
df['user_id'] = df['domain'].replace(ids)