我是Python的新手,需要一些帮助。
假设我有一个看起来像这样的数据集:
Serial Number Source
0 AB100 Donatelle
1 AB200 Qure
2 AB100 Donatelle
3 AB200 Qure
4 AB100 Grand Avenue
5 AB200 Eagle Services
6 AB300 Donatelle
7 AB300 Donatelle
8 AB100 Qure
9 AB100 Eagle Services
我需要添加一个列,如下所示:
Serial Number Source SN Data Sources
0 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
1 AB200 Qure Qure, Eagle Services
2 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
3 AB200 Qure Qure, Eagle Services
4 AB100 Grand Avenue Donatelle, Grand Avenue, Qure, Eagle Services
5 AB200 Eagle Services Qure, Eagle Services
6 AB300 Donatelle Donatelle
7 AB300 Donatelle Donatelle
8 AB100 Qure Donatelle, Grand Avenue, Qure, Eagle Services
9 AB100 Eagle Services Donatelle, Grand Avenue, Qure, Eagle Services
我的知识仍然有限,请原谅。
我正在处理一个40k行的数据帧,我需要生成一列,该列包含数据帧中每行序列号的所有不同源。
有人能帮我吗?感谢
使用groupby()
和agg()
与value_counts()
,tnenmerge()
。保留Serial Number
的顺序和Source
中元素从上到下的出现顺序:
df = pd.DataFrame(
{'Serial Number': ['AB100', 'AB200', 'AB100', 'AB200', 'AB100', 'AB200', 'AB300', 'AB300', 'AB100', 'AB100'],
'Source': ['Donatelle', 'Qure', 'Donatelle', 'Qure', 'Grand Avenue', 'Eagle Services', 'Donatelle', 'Donatelle',
'Qure', 'Eagle Services']})
df = df.merge(df.groupby('Serial Number').agg(lambda x: ', '.join(x.value_counts().keys())),
how='left', on='Serial Number', suffixes=('', '2')).rename(columns={'Source2': 'SN Data Sources'})
print(df)
打印:
Serial Number Source SN Data Sources
0 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
1 AB200 Qure Qure, Eagle Services
2 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
3 AB200 Qure Qure, Eagle Services
4 AB100 Grand Avenue Donatelle, Grand Avenue, Qure, Eagle Services
5 AB200 Eagle Services Qure, Eagle Services
6 AB300 Donatelle Donatelle
7 AB300 Donatelle Donatelle
8 AB100 Qure Donatelle, Grand Avenue, Qure, Eagle Services
9 AB100 Eagle Services Donatelle, Grand Avenue, Qure, Eagle Services
您可以在"序列号";列并将列表应用于"列";来源";柱
接下来创建一个分组df的字典,并将其转换为df。
最后将dfs合并在一起并清理列。
data = {
"Serial Number": ["AB100", "AB200", "AB100", "AB200", "AB100", "AB200"],
"Source": ["Donatelle", "Qure", "Grand Avenue", "Eagle Services", "Qure", "Grand Avenue"]
}
df = pd.DataFrame(data)
grouped_df = df.groupby("Serial Number")["Source"].apply(list).reset_index()
mapping = grouped_df.set_index("Serial Number")["Source"].to_dict()
mapping_df = pd.DataFrame.from_dict(mapping, orient="index").unstack().reset_index()
final_df = pd.merge(
grouped_df,
mapping_df,
left_on="Serial Number",
right_on="level_1"
).rename(columns={0: "Source", "Source": "SN Data Sources"}[["Serial Number", "Source", "SN Data Sources"]]
print(final_df)
Serial Number Source SN Data Sources
0 AB100 Donatelle [Donatelle, Grand Avenue, Qure]
1 AB100 Grand Avenue [Donatelle, Grand Avenue, Qure]
2 AB100 Qure [Donatelle, Grand Avenue, Qure]
3 AB200 Qure [Qure, Eagle Services, Grand Avenue]
4 AB200 Eagle Services [Qure, Eagle Services, Grand Avenue]
5 AB200 Grand Avenue [Qure, Eagle Services, Grand Avenue]