我有一个。csv文件。
time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA
大量的行,如:
2022-04-08T02:00:00+02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,
它们是重复的,大概有4个,在极值处存在差异;列这样的:
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
它按'time'排序,轴=0(它的列A,在计算表中列0)
csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')
时间17:10:25有4个副本,不匹配的如何扔掉?
这里我们有:41.589,43.6,43.6,43.6。41.589是错的,需要出来,剩下的3份只需要留1份(那个drop)。duplicate可以做到,但它不能给我4个副本来处理,它只能以3种方式设置:keep='first', keep='last'或keep=False,我不需要存在keep=True..我需要返回所有4个副本,来检查4个副本中哪一个是坏的,在我unique_seen它们之前,只减少到1,在这种情况下是正确的43.6。有人知道怎么做吗?在stack上看到了一些想法,但是不能理解到足以应用到我的情况,所以我恳请帮助。
您可以在两种不同的模式下使用两次duplicated
: keep=False
和您选择的另一种模式。然后计算一个布尔掩码从这两个切片。
假设这个示例数据集:
date col other
0 a a 0
1 a a 1
2 a X 2 # unique
3 a a 3
4 b Y 4 # unique
5 b b 5
6 b b 6
7 b b 7
你可以使用:
m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)
df2 = df[m1!=m2]
输出: date col other
0 a a 0
5 b b 5
中间体:
date col other m1 m2 m1!=m2
0 a a 0 False True True
1 a a 1 True True False
2 a X 2 False False False
3 a a 3 True True False
4 b Y 4 False False False
5 b b 5 False True True
6 b b 6 True True False
7 b b 7 True True False