从我的数据集的这个示例中,我需要删除重复的行,这些行在除"parent_state";。将被移除的重复行必须是在"N"中具有NaN的行;parent_ state";列,并保留具有"0"的行;parent_ state";值不同于NaN。在本例中,需要删除的行是第4行,索引为7789。我该怎么做?我还不知道该怎么做。
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7789 87413013 Gare de Le Havre NaN OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
您可以使用布尔掩码:
out = df[~df.drop('parent_station', axis=1).duplicated(keep=False) | pd.notna(df['parent_station'])]
输出:
stop_id stop_name parent_station
index
7022 87413385 Gare de Yvetot StopArea:OCE87413385
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
8040 87413013 Gare de Le Havre StopArea:OCE87413013
7197 87413344 Gare de Bréauté-Beuzeville NaN
trip_id arrival_time departure_time stop_sequence
index
7022 OCESN003100F140147152 05:49:00 05:50:00 2.0
3518 OCESN003100F140147152 06:12:00 06:15:00 3.0
8040 OCESN003100F140147152 05:20:00 05:20:00 0.0
7197 OCESN003100F140147152 05:35:00 05:36:00 1.0
route_id trip_headsign
index
7022 OCE1506035 3100.0
3518 OCE1506035 3100.0
8040 OCE1506035 3100.0
7197 OCE1506035 3100.0
使用drop_duplicates
:
cols = df.columns[df.columns != 'parent_station']
out = df[~(df.duplicated(cols, keep=False) & df['parent_station'].isna())]
print(out)
# Output
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
尽管它比上面的解释稍长。
ss = df.columns.drop('parent_station')
keep_rows = df[(df.duplicated(subset=ss, keep=False)) & (~df.parent_station.isna())]
non_duplicates = df[~(df.duplicated(subset=ss, keep=False))]
df = pd.concat([non_duplicates, keep_rows])
在这里,我明确区分了根本没有双精度的行(non_duplicates(和要保留的重复行(keep_rows(。
首先是sort_values
,然后是drop_duplicates
。排序时,您可以选择将NaN值保留在第一位或最后一位。默认为"last"。我们在drop_duplicates中有类似的名为keep
的参数。
参考文献:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.htmlhttps://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
output_df = df.sort_values(['parent_station'], na_position='last').drop_duplicates(['stop_id', 'stop_name'], keep='first')
非常简单的解决方案!我希望这就是你想要的:
import pandas as pd
import numpy as np
df = pd.DataFrame({'stop_id': ['87413385', '87411017', '87413013', '87413013', '87413344'],
'stop_name': ['Gare de Yvetot', 'Gare de Rouen-Rive-Droite', 'Gare de Le Havre', 'Gare de Le Havre', 'Gare de Bréauté-Beuzeville'],
'parent_station': ['StopArea:OCE87413385', 'StopArea:OCE87411017', 'StopArea:OCE87413013', np.NaN, np.NaN]})
is_duplacted = df.duplicated(subset=['stop_id', 'stop_name'])
is_nan = df['parent_station'].isna()
print(df[~(is_duplacted & is_nan)])
# stop_id stop_name parent_station
# 0 87413385 Gare de Yvetot StopArea:OCE87413385
# 1 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
# 2 87413013 Gare de Le Havre StopArea:OCE87413013
# 4 87413344 Gare de Bréauté-Beuzeville NaN