r语言 - 当"%in%"和"which"不起作用时基于字符向量的子集数据帧行



我坚持了这个简单的任务,即基于字符向量对数据帧的行进行子集化:

# the vector:
vec <- c("8cc7e656.0152.4359.8566.0581c3",    
"b3696374.c6c0.49dd.833e.596e26_D2", 
"f635496c.0046.4ecd.89bc.7a4f33_D2", 
"e1cd3d70.132b.452f.ba10.026721_D2") 
# the dataframe
df <- data.frame(PCC=c("PNNL", "VU", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "VU", "PNNL"),
Participant.ID=c("01CO001", "01CO005", "01CO008", "01CO014", "01CO019", 
"05CO002", "05CO003", "11CO051", "11CO052", "11CO053"),
Specimen.Label=c("5a3aa99d-ca10-45f6-939f-12392a_D2", "59891744-2db3-4541-a86a-7f911f_D2", 
"8cc7e656-0152-4359-8566-0581c3", "c9730cb4-b52c-4ca8-9652-4509d0_D2",
"573048dd-2502-40e0-8e8c-c41bb8_D3", "f635496c-0046-4ecd-89bc-7a4f33_D2",
"8fab37a4-cdf9-4ce8-9081-7b9148_D2", "b3696374-c6c0-49dd-833e-596e26_D2", 
"0630ecb0-b664-4e75-bb3c-fb62ee_D2", "e1cd3d70-132b-452f-ba10-026721_D2"))

其中我想获取一个数据帧,该数据帧仅包含由df$Specimen.Labelvec之间的精确匹配定义的行。使用简单df2 <- df[df$Specimen.Label %in% vec,]返回 0 行的数据帧,而使用vec2 <- which(df$Specimen.Label %in% vec)调用行索引将返回类整数的空向量。

但是,grep 返回正确的索引,例如grep("e1cd3d70.132b.452f.ba10.026721_D2", df$Specimen.Label)返回10.所以我想为什么不这样复制它:

ind <- vector("numeric")
for (i in (vec)){
a <- vec[i]
ind[i] <- as.numeric(grep(a, df$Specimen.Label))
a <- NULL
}

但不幸的是,这返回了一个向量ind(长度等于vec(,填充了NAs 而不是所需的行索引,以及一条警告,指出"要替换的项目数不是替换长度的倍数"。这是怎么回事?为什么 grep 在单独调用时有效,但在循环中使用时无法返回值?提前感谢您提供富有成效的解决方案。

(只是添加我的评论作为答案,因为它是在其他评论之前发布的(

问题是,在vec中你有点,而在df$Specimen.Label中有连字符,所以你的第一个命令不会返回任何内容。如果你写

df[df$Specimen.Label %in% gsub("\.", "-", vec),]

您获得

#     PCC Participant.ID                    Specimen.Label
# 3  PNNL        01CO008    8cc7e656-0152-4359-8566-0581c3
# 6  PNNL        05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8  PNNL        11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL        11CO053 e1cd3d70-132b-452f-ba10-026721_D2

另一个基本 R 选项是使用函数subset

subset(df, Specimen.Label %in% gsub("\.", "-", vec))

看起来您的问题是vec包含点而不是破折号。此代码可以解决此问题:

#Replace
vec <- gsub('.','-',vec,fixed = T)
#Compare
df2 <- df[df$Specimen.Label %in% vec,]
PCC Participant.ID                    Specimen.Label
3  PNNL        01CO008    8cc7e656-0152-4359-8566-0581c3
6  PNNL        05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
8  PNNL        11CO051 b3696374-c6c0-49dd-833e-596e26_D2
10 PNNL        11CO053 e1cd3d70-132b-452f-ba10-026721_D2

字符串匹配失败,因为vec中的数据由句点分隔,但df中的数据由短划线分隔。

基本 R 解决方案

如果将.替换为-,则可以将提取运算符的[形式与%in%一起使用:

# the vector:
vec <- c("8cc7e656.0152.4359.8566.0581c3",    
"b3696374.c6c0.49dd.833e.596e26_D2", 
"f635496c.0046.4ecd.89bc.7a4f33_D2", 
"e1cd3d70.132b.452f.ba10.026721_D2") 
# the dataframe
df <- data.frame(PCC=c("PNNL", "VU", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "VU", "PNNL"),
Participant.ID=c("01CO001", "01CO005", "01CO008", "01CO014", "01CO019", 
"05CO002", "05CO003", "11CO051", "11CO052", "11CO053"),
Specimen.Label=c("5a3aa99d-ca10-45f6-939f-12392a_D2", "59891744-2db3-4541-a86a-7f911f_D2", 
"8cc7e656-0152-4359-8566-0581c3", "c9730cb4-b52c-4ca8-9652-4509d0_D2",
"573048dd-2502-40e0-8e8c-c41bb8_D3", "f635496c-0046-4ecd-89bc-7a4f33_D2",
"8fab37a4-cdf9-4ce8-9081-7b9148_D2", "b3696374-c6c0-49dd-833e-596e26_D2", 
"0630ecb0-b664-4e75-bb3c-fb62ee_D2", "e1cd3d70-132b-452f-ba10-026721_D2"))
vec <- gsub("\.","\-",vec)
df[df$Specimen.Label %in% vec,]

。和输出:

> df[df$Specimen.Label %in% vec,]
PCC Participant.ID                    Specimen.Label
3  PNNL        01CO008    8cc7e656-0152-4359-8566-0581c3
6  PNNL        05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
8  PNNL        11CO051 b3696374-c6c0-49dd-833e-596e26_D2
10 PNNL        11CO053 e1cd3d70-132b-452f-ba10-026721_D2

滴灌机解决方案

具有dplyr::filter()的解决方案如下所示:

df %>% filter(Specimen.Label %in% vec)
PCC Participant.ID                    Specimen.Label
1 PNNL        01CO008    8cc7e656-0152-4359-8566-0581c3
2 PNNL        05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
3 PNNL        11CO051 b3696374-c6c0-49dd-833e-596e26_D2
4 PNNL        11CO053 e1cd3d70-132b-452f-ba10-026721_D2

最新更新