数据帧条件删除重复行



从我的数据集的这个示例中,我需要删除重复的行,这些行在除"parent_state";。将被移除的重复行必须是在"N"中具有NaN的行;parent_ state";列,并保留具有"0"的行;parent_ state";值不同于NaN。在本例中,需要删除的行是第4行,索引为7789。我该怎么做?我还不知道该怎么做。


           stop_id         stop_name                    parent_station              trip_id                 arrival_time   departure_time     stop_sequence   route_id      trip_headsign
7022       87413385        Gare de Yvetot               StopArea:OCE87413385        OCESN003100F140147152   05:49:00       05:50:00           2.0             OCE1506035    3100.0
3518       87411017        Gare de Rouen-Rive-Droite    StopArea:OCE87411017        OCESN003100F140147152   06:12:00       06:15:00           3.0             OCE1506035    3100.0
8040       87413013        Gare de Le Havre             StopArea:OCE87413013        OCESN003100F140147152   05:20:00       05:20:00           0.0             OCE1506035    3100.0
7789       87413013        Gare de Le Havre             NaN                         OCESN003100F140147152   05:20:00       05:20:00           0.0             OCE1506035    3100.0
7197       87413344        Gare de Bréauté-Beuzeville   NaN                         OCESN003100F140147152   05:35:00       05:36:00           1.0             OCE1506035    3100.0

您可以使用布尔掩码:

out = df[~df.drop('parent_station', axis=1).duplicated(keep=False) | pd.notna(df['parent_station'])]

输出:

        stop_id                   stop_name        parent_station  
index                                                               
7022   87413385              Gare de Yvetot  StopArea:OCE87413385   
3518   87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017   
8040   87413013            Gare de Le Havre  StopArea:OCE87413013   
7197   87413344  Gare de Bréauté-Beuzeville                   NaN   
                     trip_id arrival_time departure_time  stop_sequence  
index                                                                     
7022   OCESN003100F140147152     05:49:00       05:50:00            2.0   
3518   OCESN003100F140147152     06:12:00       06:15:00            3.0   
8040   OCESN003100F140147152     05:20:00       05:20:00            0.0   
7197   OCESN003100F140147152     05:35:00       05:36:00            1.0   
         route_id  trip_headsign  
index                             
7022   OCE1506035         3100.0  
3518   OCE1506035         3100.0  
8040   OCE1506035         3100.0  
7197   OCE1506035         3100.0  

使用drop_duplicates:

cols = df.columns[df.columns != 'parent_station']
out = df[~(df.duplicated(cols, keep=False) & df['parent_station'].isna())]
print(out)
# Output
       stop_id                   stop_name        parent_station                trip_id arrival_time departure_time  stop_sequence    route_id  trip_headsign
7022  87413385              Gare de Yvetot  StopArea:OCE87413385  OCESN003100F140147152     05:49:00       05:50:00            2.0  OCE1506035         3100.0
3518  87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017  OCESN003100F140147152     06:12:00       06:15:00            3.0  OCE1506035         3100.0
8040  87413013            Gare de Le Havre  StopArea:OCE87413013  OCESN003100F140147152     05:20:00       05:20:00            0.0  OCE1506035         3100.0
7197  87413344  Gare de Bréauté-Beuzeville                   NaN  OCESN003100F140147152     05:35:00       05:36:00            1.0  OCE1506035         3100.0

尽管它比上面的解释稍长。

ss = df.columns.drop('parent_station')
keep_rows = df[(df.duplicated(subset=ss, keep=False)) & (~df.parent_station.isna())]
non_duplicates = df[~(df.duplicated(subset=ss, keep=False))]
df = pd.concat([non_duplicates, keep_rows])

在这里,我明确区分了根本没有双精度的行(non_duplicates(和要保留的重复行(keep_rows(。

首先是sort_values,然后是drop_duplicates。排序时,您可以选择将NaN值保留在第一位或最后一位。默认为"last"。我们在drop_duplicates中有类似的名为keep的参数。

参考文献:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.htmlhttps://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

output_df = df.sort_values(['parent_station'], na_position='last').drop_duplicates(['stop_id', 'stop_name'], keep='first')

非常简单的解决方案!我希望这就是你想要的:

import pandas as pd
import numpy as np
df = pd.DataFrame({'stop_id': ['87413385', '87411017', '87413013', '87413013', '87413344'],
                   'stop_name': ['Gare de Yvetot', 'Gare de Rouen-Rive-Droite', 'Gare de Le Havre', 'Gare de Le Havre', 'Gare de Bréauté-Beuzeville'],
                   'parent_station': ['StopArea:OCE87413385', 'StopArea:OCE87411017', 'StopArea:OCE87413013', np.NaN, np.NaN]})
is_duplacted = df.duplicated(subset=['stop_id', 'stop_name'])
is_nan = df['parent_station'].isna()
print(df[~(is_duplacted & is_nan)])

#     stop_id                   stop_name        parent_station
# 0  87413385              Gare de Yvetot  StopArea:OCE87413385
# 1  87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017
# 2  87413013            Gare de Le Havre  StopArea:OCE87413013
# 4  87413344  Gare de Bréauté-Beuzeville                   NaN

最新更新