>我有以下格式的文件,下面显示了几行。
<N2> AS 12/13:2:-1000.00,-25.73 13/13:2:-272.09,-12.81
<N2> AS 6/6:2:-1000.00,-19.88 8/8:2:-211.51,-5.98
<N0> AS 4/4:0:2:-218.21,-11.95 4/4:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45
$3 中的第一个元素需要与 $4 中的第一个元素进行比较,用":"分隔,并仅使用 0 和 1 值重新编码。示例数据的所有四种可能的比较情况的逻辑如下所示:
when only one value differ between the two elements then change to 0/0 and 0/1
when both values differ between the two elements then change to 0/0 and 1/1
when both values are same and non-zero between the two elements then change to 1/1 and 1/1
when both the values are arleady coded in 0 and 1 do not change them.
在按照上述逻辑的示例数据中,将 $3 中的第一个元素与 $4 进行比较。
12/13 and 13/13 have one value in common separated by "/" so change then to 0/0 and 1/1
6/6 and 8/8 both values separated by "/" differ between $3 and $4, so change to 0/0 and 1/1
4/4 and 4/4 both values separated by "/" are same between $3 and $4 and non-zero values so change to 1/1 and 1/1
如果值已编码为 0 和 1,则不要更改。
因此,上述示例的输出如下所示:
<N2> AS 0/0:2:-1000.00,-25.73 0/1:2:-272.09,-12.81
<N2> AS 0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
<N0> AS 1/1:0:2:-218.21,-11.95 1/1:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45
awk 或 R 中有什么可能的解决方案吗?
您可以在 R 中执行以下操作。
数据:
df1<-
data.table::fread("<N2> AS 12/13:2:-1000.00,-25.73 13/13:2:-272.09,-12.81
<N2> AS 6/6:2:-1000.00,-19.88 8/8/0:2:-211.51,-5.98
<N0> AS 4/4:0:2:-218.21,-11.95 4/4:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45",sep=" ",header=F) %>% setDF
代码:创建一个为您完成工作并加载库的函数:
library(magrittr)
library(dplyr)
fun1 <- function(df_in) {
vals <- lapply(df_in,function(x){sub("(\d+/\d+).*","\1",x,perl=T) %>% strsplit("/") %>% lapply(as.numeric)})
newvals<-
mapply(function(x,y){
if(all(c(x,y) %in% 0:1)) list(paste0(x,collapse="/"),paste0(y,collapse="/")) else {
u = -abs(x-y)<=-1;
return(
case_when(
identical(u,c(T,F)) ~ list("0/0","0/1"),
identical(u,c(F,T)) ~ list("0/0","0/1"),
identical(u,c(T,T)) ~ list("0/0","1/1"),
identical(u,c(F,F)) ~ list("1/1","1/1"),
TRUE ~ list("Error","Error")
)
)
} },x=vals[[1]],y=vals[[2]])
return(
list(
paste0(unlist(newvals[1,]),sub("\d+/\d+","",df_in[[1]])),
paste0(unlist(newvals[2,]),sub("\d+/\d+","",df_in[[2]]))
)
)
}
调用函数:在需要更改的列号上:
df1[,3:4] %<>% fun1
结果:
#> df1
# V1 V2 V3 V4
#1 <N2> AS 0/0:2:-1000.00,-25.73 0/1:2:-272.09,-12.81
#2 <N2> AS 0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
#3 <N0> AS 1/1:0:2:-218.21,-11.95 1/1:2:-208.55,-11.01
#4 <N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
#5 <N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
#6 <N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45