根据长度分割字符向量

  • 本文关键字:字符 向量 分割 r
  • 更新时间 :
  • 英文 :


我有一个如下的字符向量:

text <- c(
"My test",
"Test2",
"Tests",
"Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
"Effort rate calculates to grant (Debt to Income Rate)", 
"Amount of pensions received mens.", 
"(Grant data) (Pension Received (Monthly Basis))", 
"Effort rate calculates to grant (Debt to Income Rate)", 
"Amount of pensions received mens.", 
"(Grant data) (Pension Received (Monthly Basis))"
)

如果没有。在整个向量中的字符数(如上所示)大于100,将其分成多个没有的字符向量。字符

nRun <- ceiling(sum(nchar(text),na.rm = T)/100)
cutsIter <- ceiling(quantile(1:length(text),probs = seq.int(0,1,(1/nRun))))

新字符Vector

text[cutsIter[1]:cutsIter[2]]

预期的结果前5个元素应该在一个向量中。第6和第7应该在同一个向量上,并继续下去。

您可以这样做。我相信有更好的方法,但这个解决方案也可以改进。为此,我选择编写一个自定义函数。还有一个问题,当只剩下一个向量nchar等于100时。这应该根据你的喜好来修改。

out <- c()
x <- nchar(text)
fn <- function(x) {

if(max(cumsum(x)) < 100) {
ind <- max(which(cumsum(x) < 100))
return(c(out, length(x)))
} else {
ind <- max(which(cumsum(x) < 100))
out <<- c(out, ind)
}

x <- x[-c(1:ind)]
fn(x)
}
# The result of the function is the indices for us to split the vector
tmp <- fn(nchar(text))
tmp
[1] 5 2 1 2 1

如果我们把它应用到向量text上:

split(text, rep(seq_len(length(tmp)), tmp))
$`1`
[1] "My test"                    "Test2"                      "Tests"                     
[4] "Dolphin Sentimental S.r.l." "Tiger Sentiyapa S.r.l."    
$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    
$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"
$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    
$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

最后,如果你想创建尽可能多的矢量:

split(text, rep(seq_len(length(tmp)), tmp)) |>
setNames(paste0("vec", seq_along(tmp))) |>
list2env(envir = globalenv())

有一个很棒的预定义函数MESS::cumsumbinning(),您可以在这些场景中轻松使用

text <- c(
"My test",
"Test2",
"Tests",
"Dolphin Sentimental S.r.l.", "Tiger Sentiyapa S.r.l.", 
"Effort rate calculates to grant (Debt to Income Rate)", 
"Amount of pensions received mens.", 
"(Grant data) (Pension Received (Monthly Basis))", 
"Effort rate calculates to grant (Debt to Income Rate)", 
"Amount of pensions received mens.", 
"(Grant data) (Pension Received (Monthly Basis))"
)
library(MESS)
split(text, cumsumbinning(nchar(text), 100))
#> $`1`
#> [1] "My test"                    "Test2"                     
#> [3] "Tests"                      "Dolphin Sentimental S.r.l."
#> [5] "Tiger Sentiyapa S.r.l."    
#> 
#> $`2`
#> [1] "Effort rate calculates to grant (Debt to Income Rate)"
#> [2] "Amount of pensions received mens."                    
#> 
#> $`3`
#> [1] "(Grant data) (Pension Received (Monthly Basis))"      
#> [2] "Effort rate calculates to grant (Debt to Income Rate)"
#> 
#> $`4`
#> [1] "Amount of pensions received mens."              
#> [2] "(Grant data) (Pension Received (Monthly Basis))"

不用说,如果你想把上面的每一个项目都保存为一个单独的项目,使用list3env作为

split(text, cumsumbinning(nchar(text), 100)) |>
list2env(envir = .GlobalEnv)

如果您希望阈值限制不超过,请在

上方使用阈值99。
split(text, cumsumbinning(nchar(text), 99))
$`1`
[1] "My test"                   
[2] "Test2"                     
[3] "Tests"                     
[4] "Dolphin Sentimental S.r.l."
[5] "Tiger Sentiyapa S.r.l."    
$`2`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    
$`3`
[1] "(Grant data) (Pension Received (Monthly Basis))"
$`4`
[1] "Effort rate calculates to grant (Debt to Income Rate)"
[2] "Amount of pensions received mens."                    
$`5`
[1] "(Grant data) (Pension Received (Monthly Basis))"

最新更新