您好,我有一个数据框架,如
COL1 start end Category
A 30 70 Cat1
A 10 20 Cat2
A 90 300 Cat2
A 12 26 Cat2
A 72 145 Cat2
B 71 145 Cat2
B 250 350 Cut3
B 355 600 Cat2
所以这里我要找一个代码来计算具有相邻df$Category=="Cat2"
值的df$Category=="Cat1"
的数量,并且这个相邻区域必须是< 5
让我们举个例子,对于每个df$COL1
和df$Category
,我计算侧翼Cat2
的数量:
这里
COL1 start end Category
A 30 70 Cat1
所以我正在寻找Cat2,以!< 25
开始,以!> 75
结束,当我查看df时,我看到有:
A 10 20 Cat2 <- this one is too faraway (-10) from 30
A 90 300 Cat2 <- this one is too faraway (+30) from 70
A 72 145 Cat2 <- this one is ok since 72 is just +2 faraway from 70
A 12 26 Cat2 <- this one is ok since 26 is just -4 faraway from 30
所以我在表中添加了一个计数,如:
New_df
COL1 Nb_flanking
A 2
然后对df$COL1 ==B
:
COL1 start end Category
B 250 350 Cut3
我正在寻找Cat2与开始!< 245
和结束!> 355
,当我查看df时,我看到有:
B 71 145 Cat2 <- this one is too faraway (-105) from 250
B 355 600 Cat2 <- this one is ok since 345 is just +5 faraway from 350
然后填充New_dfCOL1 Nb_flanking
A 2
B 1
等等,等等…
这里是数据
structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), start = c(30L,
10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L,
26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L,
2L, 2L, 2L, 2L, 3L, 2L), .Label = c("Cat1", "Cat2", "Cut3"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
非常感谢您的帮助和时间。
如果您不介意采用data.table
方式,那么可以轻松地使用非equi连接。
有些步骤可以简化,但为了清晰起见,我保留了
library(data.table)
dset <- data.table(structure(
list(
COL1 = structure(
c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L),
.Label = c("A", "B"),
class = "factor"
),
start = c(30L,
10L, 90L, 12L, 72L, 71L, 250L, 355L),
end = c(70L, 20L, 300L,
26L, 145L, 145L, 350L, 600L),
Category = structure(
c(1L, 2L,
2L, 2L, 2L, 2L, 3L, 2L),
.Label = c("Cat1", "Cat2", "Cut3"),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA, -8L)
))
# Split Cat 1 & Cat2
ds1 <- dset[Category != "Cat2"]
ds2 <- dset[Category == "Cat2"]
# The distance to flank
dto_flank <- 5
ds1[, start := start - dto_flank]
ds1[, end := end + dto_flank]
# right join between ds1 and ds2
rj <-
ds2[ds1, .(x.start, i.start, x.end, i.end, x.COL1, i.COL1), , on = .(COL1 = COL1, start <= end , end >= start)]
New_df <-
rj[, .(Nb_flanking = sum(!is.na(x.start))) , .(COL1 = i.COL1)]
New_df
#> COL1 Nb_flanking
#> 1: A 2
#> 2: B 1
由reprex包(v0.3.0)在2021-02-12创建
您可以这样做,但是使用for
循环可能不是最好的解决方案
library(tidyverse)
df <- structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), start = c(30L,
10L, 90L, 12L, 72L, 71L, 250L, 355L), end = c(70L, 20L, 300L,
26L, 145L, 145L, 350L, 600L), Category = structure(c(1L, 2L,
2L, 2L, 2L, 2L, 1L, 2L), .Label = c("Cat1", "Cat2"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
df %>%
as_tibble() -> df
df %>%
filter(Category == "Cat1") -> df1
df %>%
filter(!Category == "Cat1") -> df2
count_function <- function(start, end, df = df2){
start-5 -> min
end+5 -> max
n = 0
for(i in 1:nrow(df)){
df %>%
slice(i) -> y
if(y$start > min & y$start < max){
n <- n + 1
} else if(y$end > min & y$end < max){
n <- n + 1
}
}
n
}
df1 %>%
mutate(Nb_flanking_Cat2 = map2_dbl(start, end, count_function))
#> # A tibble: 2 x 5
#> COL1 start end Category Nb_flanking_Cat2
#> <fct> <int> <int> <fct> <dbl>
#> 1 A 30 70 Cat1 3
#> 2 B 250 350 Cat1 1
由reprex包(v0.3.0)在2021-02-12创建