我坚持了这个简单的任务,即基于字符向量对数据帧的行进行子集化:
# the vector:
vec <- c("8cc7e656.0152.4359.8566.0581c3",
"b3696374.c6c0.49dd.833e.596e26_D2",
"f635496c.0046.4ecd.89bc.7a4f33_D2",
"e1cd3d70.132b.452f.ba10.026721_D2")
# the dataframe
df <- data.frame(PCC=c("PNNL", "VU", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "VU", "PNNL"),
Participant.ID=c("01CO001", "01CO005", "01CO008", "01CO014", "01CO019",
"05CO002", "05CO003", "11CO051", "11CO052", "11CO053"),
Specimen.Label=c("5a3aa99d-ca10-45f6-939f-12392a_D2", "59891744-2db3-4541-a86a-7f911f_D2",
"8cc7e656-0152-4359-8566-0581c3", "c9730cb4-b52c-4ca8-9652-4509d0_D2",
"573048dd-2502-40e0-8e8c-c41bb8_D3", "f635496c-0046-4ecd-89bc-7a4f33_D2",
"8fab37a4-cdf9-4ce8-9081-7b9148_D2", "b3696374-c6c0-49dd-833e-596e26_D2",
"0630ecb0-b664-4e75-bb3c-fb62ee_D2", "e1cd3d70-132b-452f-ba10-026721_D2"))
其中我想获取一个数据帧,该数据帧仅包含由df$Specimen.Label
和vec
之间的精确匹配定义的行。使用简单df2 <- df[df$Specimen.Label %in% vec,]
返回 0 行的数据帧,而使用vec2 <- which(df$Specimen.Label %in% vec)
调用行索引将返回类整数的空向量。
但是,grep 返回正确的索引,例如grep("e1cd3d70.132b.452f.ba10.026721_D2", df$Specimen.Label)
返回10
.所以我想为什么不这样复制它:
ind <- vector("numeric")
for (i in (vec)){
a <- vec[i]
ind[i] <- as.numeric(grep(a, df$Specimen.Label))
a <- NULL
}
但不幸的是,这返回了一个向量ind
(长度等于vec
(,填充了NA
s 而不是所需的行索引,以及一条警告,指出"要替换的项目数不是替换长度的倍数"。这是怎么回事?为什么 grep 在单独调用时有效,但在循环中使用时无法返回值?提前感谢您提供富有成效的解决方案。
(只是添加我的评论作为答案,因为它是在其他评论之前发布的(
问题是,在vec
中你有点,而在df$Specimen.Label
中有连字符,所以你的第一个命令不会返回任何内容。如果你写
df[df$Specimen.Label %in% gsub("\.", "-", vec),]
您获得
# PCC Participant.ID Specimen.Label
# 3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
# 6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
# 8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
# 10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2
另一个基本 R 选项是使用函数subset
subset(df, Specimen.Label %in% gsub("\.", "-", vec))
看起来您的问题是vec
包含点而不是破折号。此代码可以解决此问题:
#Replace
vec <- gsub('.','-',vec,fixed = T)
#Compare
df2 <- df[df$Specimen.Label %in% vec,]
PCC Participant.ID Specimen.Label
3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2
字符串匹配失败,因为vec
中的数据由句点分隔,但df
中的数据由短划线分隔。
基本 R 解决方案
如果将.
替换为-
,则可以将提取运算符的[
形式与%in%
一起使用:
# the vector:
vec <- c("8cc7e656.0152.4359.8566.0581c3",
"b3696374.c6c0.49dd.833e.596e26_D2",
"f635496c.0046.4ecd.89bc.7a4f33_D2",
"e1cd3d70.132b.452f.ba10.026721_D2")
# the dataframe
df <- data.frame(PCC=c("PNNL", "VU", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "PNNL", "VU", "PNNL"),
Participant.ID=c("01CO001", "01CO005", "01CO008", "01CO014", "01CO019",
"05CO002", "05CO003", "11CO051", "11CO052", "11CO053"),
Specimen.Label=c("5a3aa99d-ca10-45f6-939f-12392a_D2", "59891744-2db3-4541-a86a-7f911f_D2",
"8cc7e656-0152-4359-8566-0581c3", "c9730cb4-b52c-4ca8-9652-4509d0_D2",
"573048dd-2502-40e0-8e8c-c41bb8_D3", "f635496c-0046-4ecd-89bc-7a4f33_D2",
"8fab37a4-cdf9-4ce8-9081-7b9148_D2", "b3696374-c6c0-49dd-833e-596e26_D2",
"0630ecb0-b664-4e75-bb3c-fb62ee_D2", "e1cd3d70-132b-452f-ba10-026721_D2"))
vec <- gsub("\.","\-",vec)
df[df$Specimen.Label %in% vec,]
。和输出:
> df[df$Specimen.Label %in% vec,]
PCC Participant.ID Specimen.Label
3 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
6 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
8 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
10 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2
滴灌机解决方案
具有dplyr::filter()
的解决方案如下所示:
df %>% filter(Specimen.Label %in% vec)
PCC Participant.ID Specimen.Label
1 PNNL 01CO008 8cc7e656-0152-4359-8566-0581c3
2 PNNL 05CO002 f635496c-0046-4ecd-89bc-7a4f33_D2
3 PNNL 11CO051 b3696374-c6c0-49dd-833e-596e26_D2
4 PNNL 11CO053 e1cd3d70-132b-452f-ba10-026721_D2