Python Pandas:删除基于相同id和相同日期的重复行(只保留第一行)



我有一个数据框架,看起来像这样:

<表类> id 看过 年 月 天 dayname tbody><<tr>f907942e330ac3653f8a9bd6557708722021-06-02 16:34:56202161周一042 b60106231fa8a8e43dd750432d5bc2021-06-02 16:13:29202161周一

您可以使用pd.to_datetime+dt.normalize()尝试按id和列seen的日期(不含时间)进行分组,并使用GroupBy.first()获得每个组的第一个条目,如下所示:

# Optionally convert to datetime if not already in datetime format
df['seen'] = pd.to_datetime(df['seen'])
df.groupby(['id', df['seen'].dt.normalize()], as_index=False, sort=False).first()

数据输入:

(为更全面的测试添加了一些行):

df
id                 seen  year  month  day    dayname
0  f907942e330ac3653f8a9bd655770872  2021-06-02 16:34:56  2021      6    2     Monday
1  f907942e330ac3653f8a9bd655770872  2021-06-02 17:54:56  2021      6    2     Monday
2  042b60106231fa8a8e43dd750432d5bc  2021-06-02 16:13:29  2021      6    2     Monday
3  f907942e330ac3653f8a9bd655770872  2021-06-04 16:22:56  2021      6    4  Wednesday
4  f907942e330ac3653f8a9bd655770872  2021-06-04 17:43:56  2021      6    4  Wednesday

输出:

id                 seen  year  month  day    dayname
0  f907942e330ac3653f8a9bd655770872  2021-06-02 16:34:56  2021      6    2     Monday
1  042b60106231fa8a8e43dd750432d5bc  2021-06-02 16:13:29  2021      6    2     Monday
2  f907942e330ac3653f8a9bd655770872  2021-06-04 16:22:56  2021      6    4  Wednesday

您也可以尝试:

#Your Data frame:

df=pd.DataFrame({'id':['f907942e330ac3653f8a9bd655770872','042b60106231fa8a8e43dd750432d5bc'],
'seen':['2021-06-02 16:34:56','2021-06-02 16:13:29'],
'year':['2021','2021'],
'month':[6,6],'day':[1,1],'dayname':['Monday','Monday']})

#使用drop_duplicates

df_nodups=df.drop_duplicates(subset=['id','year','month','day'])