模式匹配字符串，当除字符的最后两个元素外的所有元素都相同时

我有以下向量:

column_names <- c("6Li", "7Li", "10B", "11B", "7Li.1",
"205Pb", "206Pb", "207Pb", "238U",
"206Pb.1", "238U.1")

请注意，有些值只是带" 1"卡在最后了。我想索引出所有这些字符串以及它们对应的字符串，以便只返回以下内容。

#[1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1"

假设您不知道索引位置，因此您不能简单地将这些值索引出来，如下column_names[c(2,5,7,9,10,11)]。我如何使用模式匹配来提取这些值?

可能有一个更优雅的解决方案，但在基数R中，您可以尝试grep/gsub和paste的组合:

idx <- grep(paste(gsub("\.1", "", column_names[grep("\.1", column_names)]), collapse = "|"), column_names)
# [1]  2  5  7  9 10 11
column_names[idx]
# [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1"

使用gsub()和duplicated()查找具有重复词干的值:

column_stems <- gsub("\.1", "", column_names)
dup_idx <- duplicated(column_stems) | duplicated(column_stems, fromLast = TRUE)
column_names[dup_idx]
# "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1"

要查找以.2,.3等结尾的实例，请在gsub()中使用"\.\d+"而不是"\.1"。

您可以使用stringr:

library(stringr)
idx <- str_extract(column_names, ".*(?=\.1)")
column_names[str_detect(column_names, paste(idx[!is.na(idx)], collapse = "|"))]

这返回

#> [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1"

相关内容

最新更新

热门标签：