R:从字符向量中提取最后几个数字



在数据帧中,其中一列是文本数据,看起来像:

df <- data.frame("Index" = 1:3, "Content" = c("Happy 2021! word count: 2",
"Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. word count:100",
"Thank you very much for your time. word count: 7"))

最后几个字符总是";字数:n〃;。我希望增加n,并将其放入一个新的列中。

我试着写一个函数来做到这一点

wordCount = function (x) {
digit = -1
while(is.numeric(str_sub(essay$content,digit,-1))){
digit = digit -1
}
str_sub(essay$content,digit,-1)
}

但它不起作用,因为is.number(str_sub(文章$content,digital,-1((总是返回false,因为R.将此列视为字符

有人有更好的方法吗?

在基本R中,您可以使用:

as.numeric(gsub(".*:", "", df$Content))

我会使用stringi::stri_extract_last_regex来获得具有数字匹配正则表达式模式的最后一个数字。这应该有效:

library(stringi)
df$word_count = as.numeric(stri_extract_last_regex(df$Content, "[0-9]+"))
df["word_count"]
#   word_count
# 1          2
# 2        100
# 3          7

最新更新