为pandas中不同数据框架中的相同名称创建相同的id



我有一个具有唯一名称的数据集。另一个数据集包含与第一个数据集名称相同的几行。

我想在第一个数据集中创建一个具有唯一id的列,在第二个数据集中创建另一个具有相同id的列,对应于第一个数据集中所有相同的名称。

例如:

Dataframe 1:

player_id Name 
1        John Dosh
2        Michael Deesh
3        Julia Roberts

Dataframe 2:

player_id Name
1         John Dosh
1         John Dosh  
2         Michael Deesh
2         Michael Deesh
2         Michael Deesh
3         Julia Roberts
3         Julia Roberts

我想做的是使用两个数据帧运行深度特征合成使用特征工具。能够做这样的事情:

entity_set = ft.EntitySet("basketball_players")
entity_set.add_dataframe(dataframe_name="players_set",
dataframe=players_set,
index='name'
)
entity_set.add_dataframe(dataframe_name="season_stats",
dataframe=season_stats,
index='season_stats_id'
)

entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")

这应该满足你的问题:

import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])
df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])
print('inputs:', 'n')
print(df1)
print(df2)
df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]
print('noutputs:', 'n')
print(df1)
print(df2)

输入/输出:

inputs:
Name
0      John Dosh
1  Michael Deesh
2  Julia Roberts
Name
0      John Dosh
1      John Dosh
2  Michael Deesh
3  Michael Deesh
4  Michael Deesh
5  Julia Roberts
6  Julia Roberts
outputs:
id           Name
0   1      John Dosh
1   2  Michael Deesh
2   3  Julia Roberts
id           Name
0   1      John Dosh
1   1      John Dosh
2   2  Michael Deesh
3   2  Michael Deesh
4   2  Michael Deesh
5   3  Julia Roberts
6   3  Julia Roberts

更新:

可以得到相同结果的另一种解决方案是:

df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]

最新更新