给定输入数据帧
需要输出
我可以使用groupby fn作为df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text"))
来实现这一点,但重复项仍然存在于输出元组
在将tuple
应用于列表之前,可以使用SeriesGroupBy.unique()
获取entity_text
的唯一值,如下所示:
(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)
结果:
entity_label entity_text
0 job_title (Full Stack Developer, Senior Data Scientist, Python Developer)
1 country (India, Malaysia, Australia)
试试这个:
import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')
输出:
entity_label entity_text
0 job_title full stack developer,senior data scientiest,py...
1 country Inida,Malaysia,India,Australia