在R正则表达式中,如何能够不从目标字符串的开头计算正则表达式,而只从第n个单词开始计算正则表达式?
例如,假设有人想用符号替换字符串中的任何数字@
。然后可以使用gsub("\d+", "@", string)
,如:
gsub("\d+", "@", "words before 879 then more words then 1001 again")
结果将是:
> "words before @ then more words then @ again"
现在,跟上那个例子,使用正则表达式,如何才能使只有从字符串中第4个单词开始出现的数字才会被替换?所以上面的例子将返回"words before 879 then more words then @ again"
,因为879
是目标字符串中的第三个单词?
顺带一提,我发现了很多关于提取和定位单词的问题,有些是关于从开头匹配还是从结尾匹配,有些是关于从第n个单词开始或从第n个单词开始获取子字符串。但是没有关于如何精确地只使用正则表达式在寻找模式时忽略字符串的前n个单词。
我们可以在gsubfn
中创建proto
函数来计数单词并替换
library(gsubfn)
gsubfn("\w+", proto(fun = function(this, x) if(count > 3)
sub("\d+", "@", x) else x), str1)
#[1] "words before 879 then more words then @ again"
优点之一是它可以在任何字数计数中插入/替换或可以在多个字数计数中替换,例如,假设我们只想替换4到6之间的单词
gsubfn("\w+", proto(fun = function(this, x) if(count %in% 4:6)
sub("\d+", "@", x) else x), str1)
或更复杂的情况
gsubfn("\w+", proto(fun = function(this, x) if(count %in% c(4:6, 12:15))
sub("\d+", "@", x) else x), str2)
#[1] "words before 879 then @ replace not 1001 again and replace @ and @"
数据str1 <- "words before 879 then more words then 1001 again"
str2 <- "words before 879 then 50 replace not 1001 again and replace 1003 and 1005"
与perl=TRUE
(R中的双反斜杠)一起使用:
^s*(?:S+s*){3}(*SKIP)(*FAIL)|d+
看到证据。
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
s* whitespace (n, r, t, f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
S+ non-whitespace (all but n, r, t, f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
s* whitespace (n, r, t, f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skip the match, search for next match
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
d+ digits (0-9) (1 or more times (matching
the most amount possible))
代码示例:
gsub("^\s*(?:\S+\s*){3}(*SKIP)(*FAIL)|\d+", "@", "words before 879 then more words then 1001 again", perl=TRUE)