我有以下内容:
x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
我想保留"保利斯塔","米内罗","卡里奥卡">
我正在尝试类似的gsub
y <- gsub("\$-*","",x)
但不起作用。
两种快速方法:
x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")
第一种是标准sub
溶液;如果有不带连字符的字符串,它将返回未修改的完整字符串。
trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro" "Carioca"
在sub
:内
"^[^-]*-([^-]*)-.*$"
^ beginning of each string, avoids mid-string matches
[^-]* matches 0 or more non-hyphen characters
- literal hyphen
([^-]*) matches and stores 0 or more non-hyphen charactesr
- literal hyphen
.* 0 or more of anything (incl hyphens)
5 end of each string
"\1" replace everything that matches with the stored substring
下一个方法是通过"-"
将字符串拆分为list
,然后为第二个元素编制索引。如果存在不带连字符的字符串,则subscript out of bounds
将出错。
trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro" "Carioca"
示例调用strsplit
:
strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista " " SP"
因此第二个元素是CCD_ 7(具有额外的前导/尾随空白(。周围的sapply
总是抓取第二个元素(这就是字符串不匹配时的错误(。
两种解决方案都使用trimws
来减少前导和尾随空白。
我们只需调用sub
:
x <- c(" Sao Paulo - Paulista - SP",
"Minas Gerais - Mineiro - MG",
"Rio de Janeiro - Carioca -RJ")
sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro" "Carioca"
这个想法是捕捉每个位置的两个破折号之间发生的任何事情。
^.*-\s+ from the start, consume everything up to and including the first dash
(.*?) then match and capture everything up until the second dash
\s+-.*$ consume everything after and including the second dash