如何在R中按ID和特定列算法分组删除行?

  • 本文关键字:算法 删除行 中按 ID r
  • 更新时间 :
  • 英文 :


我有一个看起来像这样的大数据集…

Pos  GOIcp  HKGcp        ID
A2   5.49   24.92   1_pDLS_Pdyn
A3   26.80  25.71   1_pDLS_Pdyn
A5   26.83  25.44   16_pDLS_Pdyn
A6   27.03  25.53   16_pDLS_Pdyn
A7   26.78  25.28   16_pDLS_Pdyn
A9   26.91  25.97   6_pDMS_Pdyn
A10  26.65  25.98   6_pDMS_Pdyn
A11  26.15  25.60   6_pDMS_Pdyn
A13  22.93  25.50   1_pDLS_Penk
A14  22.79  25.42   1_pDLS_Penk
A15  22.76  25.29   1_pDLS_Penk
A17  21.94  24.54   16_pDLS_Penk
A18  21.67  24.46   16_pDLS_Penk
A19  22.54  25.21   16_pDLS_Penk
A22  23.15  25.17   6_pDMS_Penk
A23  22.92  25.02   6_pDMS_Penk
C1   26.25  25.58   2_pDLS_Pdyn
C2   26.95  25.99   2_pDLS_Pdyn
C3   26.82  26.06   2_pDLS_Pdyn
C5   27.22  25.55   17_pDLS_Pdyn
C6   29.25  25.61   17_pDLS_Pdyn
C7   27.27  25.71   17_pDLS_Pdyn

首先,我想按ID列对函数进行分组。然后,查看ID,我要删除任何行,其中每个ID的第二列中的行差大于该ID的1.5。但是,如果该ID的所有行差值大于1.5,则保留这些行。

为了更好的解释,第1行和第2行将被保留,因为它只有2行,而且点相距很远。但是,最后3行中有一个ID的数据点与其他2行的差异大于1.5。因此,应该从数据框中删除29.25行。

我希望这有意义。任何帮助都会很好!

我试过做一些"for循环"要做到这一点,但除了手动删除行,我不知道如何去做这个。

编辑:输出看起来像这样…

Pos  GOIcp  HKGcp        ID
A2   5.49   24.92   1_pDLS_Pdyn
A3   26.80  25.71   1_pDLS_Pdyn
A5   26.83  25.44   16_pDLS_Pdyn
A6   27.03  25.53   16_pDLS_Pdyn
A7   26.78  25.28   16_pDLS_Pdyn
A9   26.91  25.97   6_pDMS_Pdyn
A10  26.65  25.98   6_pDMS_Pdyn
A11  26.15  25.60   6_pDMS_Pdyn
A13  22.93  25.50   1_pDLS_Penk
A14  22.79  25.42   1_pDLS_Penk
A15  22.76  25.29   1_pDLS_Penk
A17  21.94  24.54   16_pDLS_Penk
A18  21.67  24.46   16_pDLS_Penk
A19  22.54  25.21   16_pDLS_Penk
A22  23.15  25.17   6_pDMS_Penk
A23  22.92  25.02   6_pDMS_Penk
C1   26.25  25.58   2_pDLS_Pdyn
C2   26.95  25.99   2_pDLS_Pdyn
C3   26.82  26.06   2_pDLS_Pdyn
C5   27.22  25.55   17_pDLS_Pdyn
C7   27.27  25.71   17_pDLS_Pdyn

我希望它像删除C6行一样简单(关于Pos列),但考虑到这是一个大数据框,我只提供了一个示例。

编辑:这是我上面的样本数据的重建…

df1 <- structure(list(Pos = c("A2", "A3", "A5", "A6", "A7", "A9", "A10", "A11", "A13", "A14", "A15", "A17", "A18", "A19", "A22", "A23", 
"C1", "C2", "C3", "C5", "C6", "C7"), GOIcp = c(5.49, 26.8, 26.83,                                                                       27.03, 26.78, 26.91, 26.65, 26.15, 22.93, 22.79, 22.76, 21.94, 
21.67, 22.54, 23.15, 22.92, 26.25, 26.95, 26.82, 27.22, 29.25, 
27.27), HKGcp = c(24.92, 25.71, 25.44, 25.53, 25.28, 25.97, 25.98, 
25.6, 25.5, 25.42, 25.29, 24.54, 24.46, 25.21, 25.17, 25.02, 
25.58, 25.99, 26.06, 25.55, 25.61, 25.71), ID = c("1_pDLS_Pdyn", 
"1_pDLS_Pdyn", "16_pDLS_Pdyn", "16_pDLS_Pdyn", "16_pDLS_Pdyn", 
"6_pDMS_Pdyn", "6_pDMS_Pdyn", "6_pDMS_Pdyn", "1_pDLS_Penk", "1_pDLS_Penk", "1_pDLS_Penk", "16_pDLS_Penk", "16_pDLS_Penk", "16_pDLS_Penk","6_pDMS_Penk", "6_pDMS_Penk", "2_pDLS_Pdyn", "2_pDLS_Pdyn", "2_pDLS_Pdyn", "17_pDLS_Pdyn", "17_pDLS_Pdyn", "17_pDLS_Pdyn")), class = "data.frame", row.names = c(NA,-22L))

没有对您的数据进行测试,因为您没有提供数据作为代码(请提供dput(),但这样的东西可能会有所帮助。

df %>% group_by(ID) %>%
arrange(GOIcp) %>%
filter(all(dif(GOIcp)>1.5[-1])|(dif(GOIcp)<=1.5|lag(dif(GOIcp)<=1.5) %>%
ungroup()

最新更新