模式匹配字符串,当除字符的最后两个元素外的所有元素都相同时



我有以下向量:

column_names <- c("6Li", "7Li", "10B", "11B", "7Li.1",
"205Pb", "206Pb", "207Pb", "238U",
"206Pb.1", "238U.1")

请注意,有些值只是带" 1"卡在最后了。我想索引出所有这些字符串以及它们对应的字符串,以便只返回以下内容。

#[1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

假设您不知道索引位置,因此您不能简单地将这些值索引出来,如下column_names[c(2,5,7,9,10,11)]。我如何使用模式匹配来提取这些值?

可能有一个更优雅的解决方案,但在基数R中,您可以尝试grep/gsubpaste的组合:

idx <- grep(paste(gsub("\.1", "", column_names[grep("\.1", column_names)]), collapse = "|"), column_names)
# [1]  2  5  7  9 10 11
column_names[idx]
# [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

使用gsub()duplicated()查找具有重复词干的值:

column_stems <- gsub("\.1", "", column_names)
dup_idx <- duplicated(column_stems) | duplicated(column_stems, fromLast = TRUE)
column_names[dup_idx]
# "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

要查找以.2,.3等结尾的实例,请在gsub()中使用"\.\d+"而不是"\.1"

您可以使用stringr:

library(stringr)
idx <- str_extract(column_names, ".*(?=\.1)")
column_names[str_detect(column_names, paste(idx[!is.na(idx)], collapse = "|"))]

这返回

#> [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

最新更新