我有一个看起来像这样的大数据集…
Pos GOIcp HKGcp ID
A2 5.49 24.92 1_pDLS_Pdyn
A3 26.80 25.71 1_pDLS_Pdyn
A5 26.83 25.44 16_pDLS_Pdyn
A6 27.03 25.53 16_pDLS_Pdyn
A7 26.78 25.28 16_pDLS_Pdyn
A9 26.91 25.97 6_pDMS_Pdyn
A10 26.65 25.98 6_pDMS_Pdyn
A11 26.15 25.60 6_pDMS_Pdyn
A13 22.93 25.50 1_pDLS_Penk
A14 22.79 25.42 1_pDLS_Penk
A15 22.76 25.29 1_pDLS_Penk
A17 21.94 24.54 16_pDLS_Penk
A18 21.67 24.46 16_pDLS_Penk
A19 22.54 25.21 16_pDLS_Penk
A22 23.15 25.17 6_pDMS_Penk
A23 22.92 25.02 6_pDMS_Penk
C1 26.25 25.58 2_pDLS_Pdyn
C2 26.95 25.99 2_pDLS_Pdyn
C3 26.82 26.06 2_pDLS_Pdyn
C5 27.22 25.55 17_pDLS_Pdyn
C6 29.25 25.61 17_pDLS_Pdyn
C7 27.27 25.71 17_pDLS_Pdyn
首先,我想按ID列对函数进行分组。然后,查看ID,我要删除任何行,其中每个ID的第二列中的行差大于该ID的1.5。但是,如果该ID的所有行差值大于1.5,则保留这些行。
为了更好的解释,第1行和第2行将被保留,因为它只有2行,而且点相距很远。但是,最后3行中有一个ID的数据点与其他2行的差异大于1.5。因此,应该从数据框中删除29.25行。
我希望这有意义。任何帮助都会很好!
我试过做一些"for循环"要做到这一点,但除了手动删除行,我不知道如何去做这个。
编辑:输出看起来像这样…
Pos GOIcp HKGcp ID
A2 5.49 24.92 1_pDLS_Pdyn
A3 26.80 25.71 1_pDLS_Pdyn
A5 26.83 25.44 16_pDLS_Pdyn
A6 27.03 25.53 16_pDLS_Pdyn
A7 26.78 25.28 16_pDLS_Pdyn
A9 26.91 25.97 6_pDMS_Pdyn
A10 26.65 25.98 6_pDMS_Pdyn
A11 26.15 25.60 6_pDMS_Pdyn
A13 22.93 25.50 1_pDLS_Penk
A14 22.79 25.42 1_pDLS_Penk
A15 22.76 25.29 1_pDLS_Penk
A17 21.94 24.54 16_pDLS_Penk
A18 21.67 24.46 16_pDLS_Penk
A19 22.54 25.21 16_pDLS_Penk
A22 23.15 25.17 6_pDMS_Penk
A23 22.92 25.02 6_pDMS_Penk
C1 26.25 25.58 2_pDLS_Pdyn
C2 26.95 25.99 2_pDLS_Pdyn
C3 26.82 26.06 2_pDLS_Pdyn
C5 27.22 25.55 17_pDLS_Pdyn
C7 27.27 25.71 17_pDLS_Pdyn
我希望它像删除C6行一样简单(关于Pos列),但考虑到这是一个大数据框,我只提供了一个示例。
编辑:这是我上面的样本数据的重建…
df1 <- structure(list(Pos = c("A2", "A3", "A5", "A6", "A7", "A9", "A10", "A11", "A13", "A14", "A15", "A17", "A18", "A19", "A22", "A23",
"C1", "C2", "C3", "C5", "C6", "C7"), GOIcp = c(5.49, 26.8, 26.83, 27.03, 26.78, 26.91, 26.65, 26.15, 22.93, 22.79, 22.76, 21.94,
21.67, 22.54, 23.15, 22.92, 26.25, 26.95, 26.82, 27.22, 29.25,
27.27), HKGcp = c(24.92, 25.71, 25.44, 25.53, 25.28, 25.97, 25.98,
25.6, 25.5, 25.42, 25.29, 24.54, 24.46, 25.21, 25.17, 25.02,
25.58, 25.99, 26.06, 25.55, 25.61, 25.71), ID = c("1_pDLS_Pdyn",
"1_pDLS_Pdyn", "16_pDLS_Pdyn", "16_pDLS_Pdyn", "16_pDLS_Pdyn",
"6_pDMS_Pdyn", "6_pDMS_Pdyn", "6_pDMS_Pdyn", "1_pDLS_Penk", "1_pDLS_Penk", "1_pDLS_Penk", "16_pDLS_Penk", "16_pDLS_Penk", "16_pDLS_Penk","6_pDMS_Penk", "6_pDMS_Penk", "2_pDLS_Pdyn", "2_pDLS_Pdyn", "2_pDLS_Pdyn", "17_pDLS_Pdyn", "17_pDLS_Pdyn", "17_pDLS_Pdyn")), class = "data.frame", row.names = c(NA,-22L))
没有对您的数据进行测试,因为您没有提供数据作为代码(请提供dput(),但这样的东西可能会有所帮助。
df %>% group_by(ID) %>%
arrange(GOIcp) %>%
filter(all(dif(GOIcp)>1.5[-1])|(dif(GOIcp)<=1.5|lag(dif(GOIcp)<=1.5) %>%
ungroup()