删除重复的两个标准区间 R



我正在使用R清理和处理数据。我想从矩阵中删除重复项。请参阅下面的示例。我想根据两个标准删除重复项,如果可能的话,使用间隔(如果在表中多次检测到同一行的 RT ± 0.1 和 m.z ± 0.001,因此请删除多余的行)。

        RT     m.z
1       2.02 326.1988
2       2.03 326.1989
3       2.06 326.1990
4       2.03 331.1533
5       2.03 375.1785
6       2.03 301.2852
7       2.04 301.2852
8       2.06 301.2852
9       2.07 357.2609
10      2.07 308.0327
11      2.08 218.2221
12      2.08 312.3617
13      2.10 473.3453
14      2.15 388.3929

我想要这样的输出:

        RT     m.z
1       2.02 326.1988
2       
3       2.06 326.1990
4       2.03 331.1533
5       2.03 375.1785
6       2.03 301.2852
7       
8       2.06 301.2852
9       2.07 357.2609
10      2.07 308.0327
11      2.08 218.2221
12      2.08 312.3617
13      2.10 473.3453
14      2.15 388.3929

如果你能帮忙,那对我有很大帮助。

提前谢谢。

这是一种使用 dplyr 的方法。不确定这是否是最有效的方法。

df <- read.table(textConnection("RT     m.z
1       2.02 326.1988
                                     2       2.03 326.1989
                                     3       2.06 326.1990
                                     4       2.03 331.1533
                                     5       2.03 375.1785
                                     6       2.03 301.2852
                                     7       2.04 301.2852
                                     8       2.06 301.2852
                                     9       2.07 357.2609
                                     10      2.07 308.0327
                                     11      2.08 218.2221
                                     12      2.08 312.3617
                                     13      2.10 473.3453
                                     14      2.15 388.3929"))

现在使用您提供的相同数据。

library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
  rtdiff = abs(lag(RT) - RT),
  mzdiff = abs(lag(m.z) - m.z)
)  %>%
  # This replaces the NAs in the first row 
  #  with large values so filter does not have to deal with NAs
  mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
         mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
  # Remove the rows that don't meet your condition
  filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
  # select only the columns you need and lose the rest
  select(RT, m.z)

给我们:

    RT      m.z
1  2.02 326.1988
2  2.06 326.1990
3  2.03 331.1533
4  2.03 375.1785
5  2.03 301.2852
6  2.06 301.2852
7  2.07 357.2609
8  2.07 308.0327
9  2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929
嗨,

似乎我的重复之间有插值。

所以我建议对Maiasaura代码进行一个小的更改。

for (i in 1:100){
    reduced.list.pre.filtering = reduced.list.pre.filtering %>% mutate(
    rtdiff = abs(lag(RT..min.,i) - RT..min.),
    mzdiff = abs(lag(Max..m.z,i) - Max..m.z))  %>%

    mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
           mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
    filter(!(rtdiff < setRT & mzdiff < setmz )) %>%
select(RT..min., Max..m.z)}

像这样,我们检查一行的所有 100 个跟随值。希望它能帮助别人。如果您有更好的解决方案,请不要犹豫。

最新更新