r语言 - 使用dplyr来计算带有坐标的侧翼区域的数量



您好,我有一个数据框架,如

COL1 start end Category
A    30    70  Cat1
A    10    20  Cat2
A    90    300 Cat2  
A    12    26  Cat2
A    72    145 Cat2
B    71    145 Cat2
B    250   350 Cut3
B    355   600 Cat2

所以这里我要找一个代码来计算具有相邻df$Category=="Cat2"值的df$Category=="Cat1"的数量,并且这个相邻区域必须是< 5

让我们举个例子,对于每个df$COL1df$Category,我计算侧翼Cat2的数量:

这里

COL1 start end  Category
A    30    70   Cat1

所以我正在寻找Cat2,以!< 25开始,以!> 75结束,当我查看df时,我看到有:

A    10    20  Cat2       <- this one is too faraway (-10) from 30
A    90    300 Cat2       <- this one is too faraway (+30) from 70
A    72    145 Cat2       <- this one is ok since 72 is just +2 faraway from 70
A    12    26  Cat2       <- this one is ok since 26 is just -4 faraway from 30

所以我在表中添加了一个计数,如:

New_df

COL1 Nb_flanking
A    2

然后对df$COL1 ==B:

做同样的操作
COL1 start end Category
B    250   350 Cut3

我正在寻找Cat2与开始!< 245和结束!> 355,当我查看df时,我看到有:

B    71    145 Cat2    <- this one is too faraway (-105) from 250
B    355   600 Cat2    <- this one is ok since 345 is just +5 faraway from 350
然后填充New_df
COL1 Nb_flanking
A    2
B    1

等等,等等…

这里是数据

structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("A", "B"), class = "factor"), start = c(30L, 
10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L, 
26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L, 
2L, 2L, 2L, 2L, 3L, 2L), .Label = c("Cat1", "Cat2", "Cut3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-8L))
非常感谢您的帮助和时间。

如果您不介意采用data.table方式,那么可以轻松地使用非equi连接。

有些步骤可以简化,但为了清晰起见,我保留了

library(data.table)
dset <- data.table(structure(
list(
COL1 = structure(
c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L),
.Label = c("A", "B"),
class = "factor"
),
start = c(30L,
10L, 90L, 12L, 72L, 71L, 250L, 355L),
end = c(70L, 20L, 300L,
26L, 145L, 145L, 350L, 600L),
Category = structure(
c(1L, 2L,
2L, 2L, 2L, 2L, 3L, 2L),
.Label = c("Cat1", "Cat2", "Cut3"),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA, -8L)
))
# Split Cat 1 & Cat2
ds1 <- dset[Category != "Cat2"]
ds2 <- dset[Category == "Cat2"]

# The distance to flank
dto_flank <- 5
ds1[, start := start - dto_flank]
ds1[, end := end + dto_flank]

# right join between ds1 and ds2
rj <-
ds2[ds1, .(x.start, i.start, x.end, i.end, x.COL1, i.COL1), , on = .(COL1 = COL1, start <= end , end >= start)]
New_df <-
rj[, .(Nb_flanking = sum(!is.na(x.start))) , .(COL1 = i.COL1)]

New_df
#>    COL1 Nb_flanking
#> 1:    A           2
#> 2:    B           1

由reprex包(v0.3.0)在2021-02-12创建

您可以这样做,但是使用for循环可能不是最好的解决方案

library(tidyverse)
df <- structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("A", "B"), class = "factor"), start = c(30L, 
                    10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L, 
                                                                  26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L, 
                                                                                                                       2L, 2L, 2L, 2L, 1L, 2L), .Label = c("Cat1", "Cat2"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                    -8L)) 
df %>% 
as_tibble() -> df
df %>% 
filter(Category == "Cat1") -> df1
df %>% 
filter(!Category == "Cat1") -> df2

count_function <- function(start, end, df = df2){
start-5 -> min
end+5 -> max

n = 0

for(i in 1:nrow(df)){
df %>% 
slice(i) -> y

if(y$start > min & y$start < max){
n <- n + 1
} else if(y$end > min & y$end < max){
n <- n + 1
} 
}

n


}
df1 %>% 
mutate(Nb_flanking_Cat2 = map2_dbl(start, end, count_function))
#> # A tibble: 2 x 5
#>   COL1  start   end Category Nb_flanking_Cat2
#>   <fct> <int> <int> <fct>               <dbl>
#> 1 A        30    70 Cat1                    3
#> 2 B       250   350 Cat1                    1

由reprex包(v0.3.0)在2021-02-12创建

最新更新