如何在python中创建一个列来迭代pandas数据集中项目的每个唯一值



我是Python的新手,需要一些帮助。

假设我有一个看起来像这样的数据集:

Serial Number        Source
0          AB100          Donatelle
1          AB200          Qure
2          AB100          Donatelle
3          AB200          Qure
4          AB100          Grand Avenue
5          AB200          Eagle Services
6          AB300          Donatelle
7          AB300          Donatelle
8          AB100          Qure
9          AB100          Eagle Services

我需要添加一个列,如下所示:

Serial Number        Source         SN Data Sources
0          AB100          Donatelle        Donatelle, Grand Avenue, Qure, Eagle Services
1          AB200          Qure             Qure, Eagle Services
2          AB100          Donatelle        Donatelle, Grand Avenue, Qure, Eagle Services
3          AB200          Qure             Qure, Eagle Services
4          AB100          Grand Avenue     Donatelle, Grand Avenue, Qure, Eagle Services
5          AB200          Eagle Services   Qure, Eagle Services
6          AB300          Donatelle        Donatelle
7          AB300          Donatelle        Donatelle
8          AB100          Qure             Donatelle, Grand Avenue, Qure, Eagle Services
9          AB100          Eagle Services   Donatelle, Grand Avenue, Qure, Eagle Services

我的知识仍然有限,请原谅。

我正在处理一个40k行的数据帧,我需要生成一列,该列包含数据帧中每行序列号的所有不同源。

有人能帮我吗?感谢

使用groupby()agg()value_counts(),tnenmerge()。保留Serial Number的顺序和Source中元素从上到下的出现顺序:

df = pd.DataFrame(
{'Serial Number': ['AB100', 'AB200', 'AB100', 'AB200', 'AB100', 'AB200', 'AB300', 'AB300', 'AB100', 'AB100'],
'Source': ['Donatelle', 'Qure', 'Donatelle', 'Qure', 'Grand Avenue', 'Eagle Services', 'Donatelle', 'Donatelle',
'Qure', 'Eagle Services']})
df = df.merge(df.groupby('Serial Number').agg(lambda x: ', '.join(x.value_counts().keys())),
how='left', on='Serial Number', suffixes=('', '2')).rename(columns={'Source2': 'SN Data Sources'})
print(df)

打印:

Serial Number          Source                                SN Data Sources
0         AB100       Donatelle  Donatelle, Grand Avenue, Qure, Eagle Services
1         AB200            Qure                           Qure, Eagle Services
2         AB100       Donatelle  Donatelle, Grand Avenue, Qure, Eagle Services
3         AB200            Qure                           Qure, Eagle Services
4         AB100    Grand Avenue  Donatelle, Grand Avenue, Qure, Eagle Services
5         AB200  Eagle Services                           Qure, Eagle Services
6         AB300       Donatelle                                      Donatelle
7         AB300       Donatelle                                      Donatelle
8         AB100            Qure  Donatelle, Grand Avenue, Qure, Eagle Services
9         AB100  Eagle Services  Donatelle, Grand Avenue, Qure, Eagle Services

您可以在"序列号";列并将列表应用于"列";来源";柱

接下来创建一个分组df的字典,并将其转换为df。

最后将dfs合并在一起并清理列。

data = {
"Serial Number": ["AB100", "AB200", "AB100", "AB200", "AB100", "AB200"],
"Source": ["Donatelle", "Qure", "Grand Avenue", "Eagle Services", "Qure", "Grand Avenue"]
}
df = pd.DataFrame(data)
grouped_df = df.groupby("Serial Number")["Source"].apply(list).reset_index()
mapping = grouped_df.set_index("Serial Number")["Source"].to_dict()
mapping_df = pd.DataFrame.from_dict(mapping, orient="index").unstack().reset_index()
final_df = pd.merge(
grouped_df,
mapping_df,
left_on="Serial Number",
right_on="level_1"
).rename(columns={0: "Source", "Source": "SN Data Sources"}[["Serial Number", "Source", "SN Data Sources"]]
print(final_df)
Serial Number          Source                       SN Data Sources
0         AB100       Donatelle       [Donatelle, Grand Avenue, Qure]
1         AB100    Grand Avenue       [Donatelle, Grand Avenue, Qure]
2         AB100            Qure       [Donatelle, Grand Avenue, Qure]
3         AB200            Qure  [Qure, Eagle Services, Grand Avenue]
4         AB200  Eagle Services  [Qure, Eagle Services, Grand Avenue]
5         AB200    Grand Avenue  [Qure, Eagle Services, Grand Avenue]

最新更新