我有一个具有以下结构的pd.DataFrame
:
$ df.head()
target refTime name latitude longitude
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
3 5.0 2020-05-31 22:47:00 SUMU -34.830002 -56.000000
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001
这是2020年5月31日各个站点记录的天气数据。可能有重复的(站(名称,例如SUMU
.但是,每个副本都有不同的refTime
。
问题:如何删除重复的台站记录,同时保留最接近"完整"小时的观测值(在本例中为23:00:00
小时(?对于这个特定示例,我最终会删除第 3 行。
让我们尝试assign
dt.round
这个想法是按它们与小时的距离对值进行排序,然后我们将对值进行排序并drop_duplicates
保留第一个实例。
import pandas as pd
import numpy as np
df2 = (
df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
.sort_values("hour_diff")
.drop_duplicates(subset=["name"], keep="first")
.drop("hour_diff", axis=1)
)
print(df2)
target refTime name latitude longitude
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000
距离度量将如下所示:
df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
target refTime name latitude longitude hour_diff
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002 12.0
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003 9.0
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993 4.0
3 5.0 2020-05-31 22:47:00 SUMU -34.830002 -56.000000 13.0 # we drop this
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000 14.0
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998 0.0
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000 0.0
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000 1.0 # we keep this one
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997 0.0
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002 0.0
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001 10.0
对舍入值使用Series.dt.round
,减去ref_time
,按Series.abs
创建绝对值,并对最小值的索引使用DataFrameGroupBy.idxmin
,因此可以按DataFrame.loc
选择:
idx = (df["refTime"].dt.round("H").sub(df["refTime"]).abs()
.groupby(df['name'], sort=False).idxmin())
df = df.loc[idx]
print (df)
target refTime name latitude longitude
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001
详情:
print (df["refTime"].dt.round("H").sub(df["refTime"]))
0 00:12:00
1 00:09:00
2 00:04:00
3 00:13:00
4 00:14:00
5 00:00:00
6 00:00:00
7 00:01:00
8 00:00:00
9 00:00:00
10 00:10:00
Name: refTime, dtype: timedelta64[ns]
print (idx)
name
YMLT 0
YWGT 1
YMAY 2
SUMU 7
FACT 4
SGES 5
SGAS 6
NFFN 8
SBPS 9
NSTU 10
Name: refTime, dtype: int64
我们可以使用 pandas .drop_duplicates(( 函数来删除所有重复的另一行的行。 在您的情况下调用 df.drop_duplicates((