在pandas数据框架中对by进行分组时删除重复的值



给定输入数据帧

需要输出

我可以使用groupby fn作为df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text"))来实现这一点,但重复项仍然存在于输出元组

在将tuple应用于列表之前,可以使用SeriesGroupBy.unique()获取entity_text的唯一值,如下所示:

(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)

结果:

entity_label                                                      entity_text
0    job_title  (Full Stack Developer, Senior Data Scientist, Python Developer)
1      country                                     (India, Malaysia, Australia)

试试这个:

import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')

输出:

entity_label    entity_text
0   job_title   full stack developer,senior data scientiest,py...
1   country     Inida,Malaysia,India,Australia

最新更新