熊猫drop_duplicates并保留最接近参考时间的值



我有一个具有以下结构的pd.DataFrame

$ df.head()
target             refTime        name   latitude   longitude  
0          5.0 2020-05-31 22:48:00        YMLT -41.529999  147.190002   
1          6.0 2020-05-31 22:51:00        YWGT -36.419998  146.300003   
2          6.0 2020-05-31 22:56:00        YMAY -36.060001  146.929993   
3          5.0 2020-05-31 22:47:00        SUMU -34.830002  -56.000000   
4          3.0 2020-05-31 22:46:00        FACT -33.990002   18.600000   
5          5.0 2020-05-31 23:00:00        SGES -25.450001  -54.849998   
6          5.0 2020-05-31 23:00:00        SGAS -25.250000  -57.520000
7          5.0 2020-05-31 22:59:00        SUMU -34.830002  -56.000000      
8          8.0 2020-05-31 23:00:00        NFFN -17.750000  177.449997   
9          7.0 2020-05-31 23:00:00        SBPS -16.430000  -39.080002   
10         7.0 2020-05-31 22:50:00        NSTU -14.330000 -170.720001  

这是2020年5月31日各个站点记录的天气数据。可能有重复的(站(名称,例如SUMU.但是,每个副本都有不同的refTime

问题:如何删除重复的台站记录,同时保留最接近"完整"小时的观测值(在本例中为23:00:00小时(?对于这个特定示例,我最终会删除第 3 行。

让我们尝试assigndt.round

这个想法是按它们与小时的距离对值进行排序,然后我们将对值进行排序并drop_duplicates保留第一个实例。

import pandas as pd
import numpy as np
df2 = (
df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
.sort_values("hour_diff")
.drop_duplicates(subset=["name"], keep="first")
.drop("hour_diff", axis=1)
)
print(df2)
target             refTime  name   latitude   longitude
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000

距离度量将如下所示:

df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
target             refTime  name   latitude   longitude  hour_diff
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002       12.0
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003        9.0
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993        4.0
3      5.0 2020-05-31 22:47:00  SUMU -34.830002  -56.000000       13.0 # we drop this
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000       14.0
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998        0.0
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000        0.0
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000        1.0 # we keep this one
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997        0.0
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002        0.0
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001       10.0

对舍入值使用Series.dt.round,减去ref_time,按Series.abs创建绝对值,并对最小值的索引使用DataFrameGroupBy.idxmin,因此可以按DataFrame.loc选择:

idx = (df["refTime"].dt.round("H").sub(df["refTime"]).abs()
.groupby(df['name'], sort=False).idxmin())
df = df.loc[idx]
print (df)
target             refTime  name   latitude   longitude
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001

详情

print (df["refTime"].dt.round("H").sub(df["refTime"]))
0    00:12:00
1    00:09:00
2    00:04:00
3    00:13:00
4    00:14:00
5    00:00:00
6    00:00:00
7    00:01:00
8    00:00:00
9    00:00:00
10   00:10:00
Name: refTime, dtype: timedelta64[ns]

print (idx)
name
YMLT     0
YWGT     1
YMAY     2
SUMU     7
FACT     4
SGES     5
SGAS     6
NFFN     8
SBPS     9
NSTU    10
Name: refTime, dtype: int64

我们可以使用 pandas .drop_duplicates(( 函数来删除所有重复的另一行的行。 在您的情况下调用 df.drop_duplicates((

最新更新