如何从副本中保留特定的副本?



我有一个。csv文件。

time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA

大量的行,如:

2022-04-08T02:00:00+02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,

它们是重复的,大概有4个,在极值处存在差异;列这样的:

2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954

它按'time'排序,轴=0(它的列A,在计算表中列0)

csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')

时间17:10:25有4个副本,不匹配的如何扔掉?

这里我们有:41.589,43.6,43.6,43.6。41.589是错的,需要出来,剩下的3份只需要留1份(那个drop)。duplicate可以做到,但它不能给我4个副本来处理,它只能以3种方式设置:keep='first', keep='last'或keep=False,我不需要存在keep=True..我需要返回所有4个副本,来检查4个副本中哪一个是坏的,在我unique_seen它们之前,只减少到1,在这种情况下是正确的43.6。有人知道怎么做吗?在stack上看到了一些想法,但是不能理解到足以应用到我的情况,所以我恳请帮助。

您可以在两种不同的模式下使用两次duplicated: keep=False和您选择的另一种模式。然后计算一个布尔掩码从这两个切片。

假设这个示例数据集:

  date col  other
0    a   a      0
1    a   a      1
2    a   X      2   # unique
3    a   a      3
4    b   Y      4   # unique
5    b   b      5
6    b   b      6
7    b   b      7

你可以使用:

m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)
df2 = df[m1!=m2]

输出:

  date col  other
0    a   a      0
5    b   b      5

中间体:

  date col  other     m1     m2  m1!=m2
0    a   a      0  False   True    True
1    a   a      1   True   True   False
2    a   X      2  False  False   False
3    a   a      3   True   True   False
4    b   Y      4  False  False   False
5    b   b      5  False   True    True
6    b   b      6   True   True   False
7    b   b      7   True   True   False

最新更新