在数据帧中,其中一列是文本数据,看起来像:
df <- data.frame("Index" = 1:3, "Content" = c("Happy 2021! word count: 2",
"Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. word count:100",
"Thank you very much for your time. word count: 7"))
最后几个字符总是";字数:n〃;。我希望增加n,并将其放入一个新的列中。
我试着写一个函数来做到这一点
wordCount = function (x) {
digit = -1
while(is.numeric(str_sub(essay$content,digit,-1))){
digit = digit -1
}
str_sub(essay$content,digit,-1)
}
但它不起作用,因为is.number(str_sub(文章$content,digital,-1((总是返回false,因为R.将此列视为字符
有人有更好的方法吗?
在基本R
中,您可以使用:
as.numeric(gsub(".*:", "", df$Content))
我会使用stringi::stri_extract_last_regex
来获得具有数字匹配正则表达式模式的最后一个数字。这应该有效:
library(stringi)
df$word_count = as.numeric(stri_extract_last_regex(df$Content, "[0-9]+"))
df["word_count"]
# word_count
# 1 2
# 2 100
# 3 7