我想比较两列:
A列:
IGHV3-21*02,IGHV3-30-5*04
IGHV3-30*18,IGHV3-30-5*01
IGHV5-51*01
IGHV5-76*01
B列:
IGHV3-21*02
IGHV3-30*18
IGHV5-51*01
IGHV6-51*01
如果A列中的任何项目与B列的任何项目匹配,则匹配(反之亦然(
预期输出应为:
列匹配:
TRUE
TRUE
TRUE
FALSE
在R中,最简单的方法可能是:
df$columnA %in% df$columnB
但这并没有考虑到给定位置的两个项目,并且将返回:
FALSE
FALSE
TRUE
FALSE
知道如何在%中使用逗号分隔的单词吗?
这行吗:
library(dplyr)
library(tidyr)
library(stringr)
df %>% mutate(id = row_number()) %>%
separate_rows(columnA, sep = ',') %>%
mutate(match = columnA == columnB) %>%
group_by(id) %>% mutate(columnA = toString(columnA)) %>%
mutate(match = if_else(any(match == TRUE), TRUE, FALSE)) %>%
distinct() %>% ungroup() %>% select(-id)
# A tibble: 4 x 3
columnA columnB match
<chr> <chr> <lgl>
1 IGHV3-21*02, IGHV3-30-5*04 IGHV3-21*02 TRUE
2 IGHV3-30*18, IGHV3-30-5*01 IGHV3-30*18 TRUE
3 IGHV5-51*01 IGHV5-51*01 TRUE
4 IGHV5-76*01 IGHV6-51*01 FALSE
使用的数据:
df
columnA columnB
1 IGHV3-21*02,IGHV3-30-5*04 IGHV3-21*02
2 IGHV3-30*18,IGHV3-30-5*01 IGHV3-30*18
3 IGHV5-51*01 IGHV5-51*01
4 IGHV5-76*01 IGHV6-51*01
也许这会很有用:
library(tidyverse)
#Code
newdf <- df1 %>% mutate(id=row_number()) %>%
separate_rows(V1,sep=',') %>% left_join(df2 %>% mutate(Match=T)) %>%
group_by(id) %>%
mutate(Val=ifelse(any(Match & !is.na(Match)),T,F)) %>%
select(-Match) %>%
summarise(V1=paste0(V1,collapse = ','),
Val=sum(Val)) %>%
mutate(Val=ifelse(Val>0,T,F)) %>%
ungroup() %>% select(-id)
输出:
# A tibble: 4 x 2
V1 Val
<chr> <lgl>
1 IGHV3-21*02,IGHV3-30-5*04 TRUE
2 IGHV3-30*18,IGHV3-30-5*01 TRUE
3 IGHV5-51*01 TRUE
4 IGHV5-76*01 FALSE
使用的一些数据:
#Data1
df1 <- structure(list(V1 = c("IGHV3-21*02,IGHV3-30-5*04", "IGHV3-30*18,IGHV3-30-5*01",
"IGHV5-51*01", "IGHV5-76*01")), class = "data.frame", row.names = c(NA,
-4L))
#Data2
df2 <- structure(list(V1 = c("IGHV3-21*02", "IGHV3-30*18", "IGHV5-51*01",
"IGHV6-51*01")), class = "data.frame", row.names = c(NA, -4L))
看看base::charmatch
。下面是一个简单的函数包装器
`%pin%` <- function(x, y) {
out <- logical(length(x))
p <- unique(charmatch(y, x, 0L))
out[p[p > 0L]] <- TRUE
out
}
数据
x <- c("IGHV3-21*02,IGHV3-30-5*04",
"IGHV3-30*18,IGHV3-30-5*01",
"IGHV5-51*01",
"IGHV5-76*01")
y <- c(
"IGHV3-21*02",
"IGHV3-30*18",
"IGHV5-51*01",
"IGHV6-51*01"
)
使用
> x %pin% y
[1] TRUE TRUE TRUE FALSE
您可以使用tidyr
将带逗号的行分解为不同的行:
df1 <- df %>% separate_rows(columnA,sep=",")