在r中包含两个数据帧的文档术语矩阵



我想用两个数据帧在R中创建一个文档术语。

例如,第一个数据帧包含文本。

df1

category     text
person1      "hello word I like turtles"
person2      "re: turtles! I think turtles are stellar!"
person3      "sunflowers are nice."

第二个数据框有一个列,其中包含所有感兴趣的项。

df2

col1    term
x       turtles
y       hello
w       sunflowers
f       I

结果矩阵将显示每个人对df2$terms中每个单词的使用情况。

结果

category    turtles     hello     sunflowers     I         
person1       1           1            0         1
person2       2           0            0         1
person3       0           0            1         0

帮助!

这里有一个使用正则表达式的hack, apply和merge:

category = c('person1','person2','person3')
text=c("hello word I like turtles", "re: turtles! I think turtles are stellar!", "sunflowers are nice.")
df1 = data.frame(category=category,text=text)
df2 = data.frame(term=c('turtles','hello','sunflowers','I'))
f = function(pattern){

patterncount = function(x){ # Counts occurrence of pattern in a string

if (grepl(x, pattern=pattern)){

length(gregexpr(x, pattern=pattern)[[1]])
} else{
0
}
}
sapply(df1$text, FUN = patterncount)

}
df3 = data.frame(sapply(df2$term, FUN=f))
df3$text = row.names(df3)
result = merge(df1, df3, by='text')

使用str_countstringr-

library(stringr)
cbind(df1[1], sapply(df2$term, function(x) str_count(df1$text, x)))
#  category turtles hello sunflowers I
#1  person1       1     1          0 1
#2  person2       2     0          0 1
#3  person3       0     0          1 0

相关内容

  • 没有找到相关文章