R-如何从字符串中提取零件



我有一个名为模式的字符串:

PATTERN <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"

我想使用模式匹配函数(例如 grepsub,...获得字符串变量 model 等于" name.model" ,字符串变量结果等于" any.outcome" 和整数变量 IMP 等于 number

如果型号结果 IMP 都是整数,我可以使用函数sub

获得值
PATTERN <- "MODEL_002-OUTCOME_007-IMP_001"
pattern_build <- "MODEL_([0-9]+)-OUTCOME_([0-9]+)-IMP_([0-9]+)"
MODEL <- as.integer(sub(pattern_build, "\1", PATTERN))
OUTCOME <- as.integer(sub(pattern_build, "\2", PATTERN))
IMP <- as.integer(sub(pattern_build, "\3", PATTERN))

您是否知道如何匹配变量模式中包含的字符串

可能的棘手模式是:

PATTERN <- "MODEL_PS2-OUTCOME_stroke_i-IMP_001"
PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"

也能够处理'棘手'模式的解决方案:

PATTERN <- "MODEL_linear-model-OUTCOME_stroke_i-IMP_001"
lst <- strsplit(PATTERN, '([A-Z]+_)')[[1]][2:4]
lst <- sub('-$','',lst)

给出:

> lst
[1] "linear-model" "stroke_i"     "001"

,如果您在数据框架中需要:

df <- as.data.frame.list(lst)
names(df) <- c('MODEL','OUTCOME','IMP')

给出:

> df
         MODEL  OUTCOME IMP
1 linear-model stroke_i 001

最小regex方法,

sapply(strsplit(PATTERN, '-'), function(i) sub('(.*?_){1}', '', i))
#     [,1]      
#[1,] "PS2"     
#[2,] "stroke_i"
#[3,] "001"     

您可以使用捕获符合任何字符的组的模式,尽可能少,在已知的划界substring之间尽可能少:

MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)

请参阅正则演示。请注意,最后一个.*是贪婪的

您可能会精确使用此模式以仅允许匹配的预期字符(例如,将数字匹配到最后一个捕获组中,使用([0-9]+)而不是(.*)

stringr str_match一起使用。

> library(stringr)
> x <- "MODEL_Name.model-OUTCOME_any.outcome-IMP_number"
> res <- str_match(x, "MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)")
> res[,2]
[1] "Name.model"
> res[,3]
[1] "any.outcome"
> res[,4]
[1] "number"
> 

使用同一正则义务的基本R解决方案将涉及regmatches/regexec

> res <- regmatches(x, regexec("MODEL_(.*?)-OUTCOME_(.*?)-IMP_(.*)", x))[[1]]
> res[2]
[1] "Name.model"
> res[3]
[1] "any.outcome"
> res[4]
[1] "number"
> 

最新更新