r语言 - 一组字符串中每个单词的频率



>我在数据帧中有一列,其中每一行都是一个字符串。我想获取此列中每个单词的频率。

我试过:

prov <- df$column_x %>%
    na.omit() %>%
    tolower() %>%
    gsub("[,;?']", " ",.)
sort(table(prov), decreasing = TRUE)

这样,我得到了每个string重复的次数。

如何获取每个word重复的次数?

管道可以完成这项工作。

df <- data.frame(column_x = c("hello world", "hello morning hello", 
                              "bye bye world"), stringsAsFactors = FALSE)
require(dplyr)
df$column_x %>%
  na.omit() %>%
  tolower() %>%
  strsplit(split = " ") %>% # or strsplit(split = "\W") 
  unlist() %>%
  table() %>%
  sort(decreasing = TRUE)

听起来你想要一个文档术语矩阵...

library(tm)
corp <- Corpus(VectorSource(df$x)) # convert column of strings into a corpus
dtm <- DocumentTermMatrix(corp)    # create document term matrix
> as.matrix(dtm)
    Terms
Docs hello world morning bye
   1     1     1       0   0
   2     2     0       1   0
   3     0     1       0   2

如果您希望将其连接到原始数据框,也可以这样做:

cbind(df, data.frame(as.matrix(dtm)))
                    x hello world morning bye
1         hello world     1     1       0   0
2 hello morning hello     2     0       1   0
3       bye bye world     0     1       0   2

使用的示例数据:

df <- data.frame(
  x = c("hello world", 
        "hello morning hello", 
        "bye bye world"),
  stringsAsFactors = FALSE
)
> df
                    x
1         hello world
2 hello morning hello
3       bye bye world

您可以将列折叠为一个字符串,然后使用正则表达式\W not word将此字符串拆分为单词,并使用table函数计算每个单词频率。

library(dplyr)
x <- c("First part of some text,", "another part of text,",NA , "last part of text.")
x <- x %>% na.omit() %>% tolower() 
xx <- paste(x, collapse = " ")
xxx <- unlist(strsplit(xx, "\W"))
table(xxx)
xxx
        another   first    last      of    part    some    text 
      2       1       1       1       3       3       1       3 

最新更新