比较两列中的值,并在 R 或 awk 中重新编码



>我有以下格式的文件,下面显示了几行。

<N2>    AS  12/13:2:-1000.00,-25.73     13/13:2:-272.09,-12.81
<N2>    AS  6/6:2:-1000.00,-19.88   8/8:2:-211.51,-5.98
<N0>    AS  4/4:0:2:-218.21,-11.95  4/4:2:-208.55,-11.01
<N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
<N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
<N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45

$3 中的第一个元素需要与 $4 中的第一个元素进行比较,用":"分隔,并仅使用 0 和 1 值重新编码。示例数据的所有四种可能的比较情况的逻辑如下所示:

when only one value differ between the two elements then change to 0/0  and 0/1    
when both values differ between the two elements then change to 0/0  and 1/1  
when both values are same and non-zero  between the two elements  then change to 1/1  and  1/1
when both the values are arleady coded in 0 and 1 do not change them.

在按照上述逻辑的示例数据中,将 $3 中的第一个元素与 $4 进行比较。

12/13  and  13/13 have one value in common separated by "/" so change then to 0/0 and 1/1
6/6 and 8/8 both values separated by "/" differ between $3 and $4, so change to 0/0 and 1/1
4/4 and 4/4 both values separated by "/" are same between $3 and $4 and non-zero values so change to   1/1 and 1/1

如果值已编码为 0 和 1,则不要更改。

因此,上述示例的输出如下所示:

<N2>    AS  0/0:2:-1000.00,-25.73   0/1:2:-272.09,-12.81
<N2>    AS  0/0:2:-1000.00,-19.88   1/1/0:2:-211.51,-5.98
<N0>    AS  1/1:0:2:-218.21,-11.95  1/1:2:-208.55,-11.01
<N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
<N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
<N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45

awk 或 R 中有什么可能的解决方案吗?

您可以在 R 中执行以下操作。

数据:

df1<-
data.table::fread("<N2>    AS  12/13:2:-1000.00,-25.73     13/13:2:-272.09,-12.81
<N2>    AS  6/6:2:-1000.00,-19.88   8/8/0:2:-211.51,-5.98
<N0>    AS  4/4:0:2:-218.21,-11.95  4/4:2:-208.55,-11.01
<N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
<N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
<N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45",sep=" ",header=F) %>% setDF

代码:创建一个为您完成工作并加载库的函数:

library(magrittr)
library(dplyr)
fun1 <- function(df_in) {
vals <- lapply(df_in,function(x){sub("(\d+/\d+).*","\1",x,perl=T) %>% strsplit("/") %>% lapply(as.numeric)})
newvals<-
mapply(function(x,y){
if(all(c(x,y) %in% 0:1)) list(paste0(x,collapse="/"),paste0(y,collapse="/")) else {
u = -abs(x-y)<=-1;
return(
case_when(
identical(u,c(T,F)) ~ list("0/0","0/1"),
identical(u,c(F,T)) ~ list("0/0","0/1"),
identical(u,c(T,T)) ~ list("0/0","1/1"),
identical(u,c(F,F)) ~ list("1/1","1/1"),
TRUE    ~ list("Error","Error")
)
)
} },x=vals[[1]],y=vals[[2]])
return(
list(
paste0(unlist(newvals[1,]),sub("\d+/\d+","",df_in[[1]])),
paste0(unlist(newvals[2,]),sub("\d+/\d+","",df_in[[2]]))
)
)
}

调用函数:在需要更改的列号上:

df1[,3:4] %<>% fun1

结果:

#> df1
#    V1 V2                     V3                    V4
#1 <N2> AS  0/0:2:-1000.00,-25.73  0/1:2:-272.09,-12.81
#2 <N2> AS  0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
#3 <N0> AS 1/1:0:2:-218.21,-11.95  1/1:2:-208.55,-11.01
#4 <N0> AS  0/0:2:-1000.00,-16.68  0/0:2:-294.18,-10.45
#5 <N0> AS  0/1:2:-1000.00,-16.68  0/1:2:-294.18,-10.45
#6 <N0> AS  1/1:2:-1000.00,-16.68  1/1:2:-294.18,-10.45

最新更新