使用gsub将短语中的中间单词用R中的短划线隔开



我有以下内容:

x <- c("Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

我想保留"保利斯塔","米内罗","卡里奥卡">

我正在尝试类似的gsub

y <- gsub("\$-*","",x)

但不起作用。

两种快速方法:

x<- c(" Sao Paulo - Paulista - SP", "Minas Gerais - Mineiro - MG", "Rio de Janeiro - Carioca -RJ")

第一种是标准sub溶液;如果有不带连字符的字符串,它将返回未修改的完整字符串。

trimws(sub("^[^-]*-([^-]*)-.*$", "\1", x))
# [1] "Paulista" "Mineiro"  "Carioca" 

sub:内

"^[^-]*-([^-]*)-.*$"
^                   beginning of each string, avoids mid-string matches
[^-]*              matches 0 or more non-hyphen characters
-             literal hyphen
([^-]*)      matches and stores 0 or more non-hyphen charactesr
-     literal hyphen
.*   0 or more of anything (incl hyphens)
5  end of each string
"\1"                replace everything that matches with the stored substring

下一个方法是通过"-"将字符串拆分为list,然后为第二个元素编制索引。如果存在不带连字符的字符串,则subscript out of bounds将出错。

trimws(sapply(strsplit(x, "-"), `[[`, 2))
# [1] "Paulista" "Mineiro"  "Carioca" 

示例调用strsplit:

strsplit(x[[1]], "-")
# [[1]]
# [1] " Sao Paulo " " Paulista "  " SP"        

因此第二个元素是CCD_ 7(具有额外的前导/尾随空白(。周围的sapply总是抓取第二个元素(这就是字符串不匹配时的错误(。

两种解决方案都使用trimws来减少前导和尾随空白。

我们只需调用sub:

x <- c(" Sao Paulo - Paulista - SP",
"Minas Gerais - Mineiro - MG",
"Rio de Janeiro - Carioca -RJ")
sub("^.*-\s+(.*?)\s+-.*$", "\1", x)
[1] "Paulista" "Mineiro"  "Carioca"

这个想法是捕捉每个位置的两个破折号之间发生的任何事情。

^.*-\s+   from the start, consume everything up to and including the first dash
(.*?)      then match and capture everything up until the second dash
\s+-.*$   consume everything after and including the second dash

最新更新