我想用两个数据帧在R中创建一个文档术语。
例如,第一个数据帧包含文本。
df1
category text
person1 "hello word I like turtles"
person2 "re: turtles! I think turtles are stellar!"
person3 "sunflowers are nice."
第二个数据框有一个列,其中包含所有感兴趣的项。
df2
col1 term
x turtles
y hello
w sunflowers
f I
结果矩阵将显示每个人对df2$terms
中每个单词的使用情况。
结果
category turtles hello sunflowers I
person1 1 1 0 1
person2 2 0 0 1
person3 0 0 1 0
帮助!
这里有一个使用正则表达式的hack, apply和merge:
category = c('person1','person2','person3')
text=c("hello word I like turtles", "re: turtles! I think turtles are stellar!", "sunflowers are nice.")
df1 = data.frame(category=category,text=text)
df2 = data.frame(term=c('turtles','hello','sunflowers','I'))
f = function(pattern){
patterncount = function(x){ # Counts occurrence of pattern in a string
if (grepl(x, pattern=pattern)){
length(gregexpr(x, pattern=pattern)[[1]])
} else{
0
}
}
sapply(df1$text, FUN = patterncount)
}
df3 = data.frame(sapply(df2$term, FUN=f))
df3$text = row.names(df3)
result = merge(df1, df3, by='text')
使用str_count
stringr
-
library(stringr)
cbind(df1[1], sapply(df2$term, function(x) str_count(df1$text, x)))
# category turtles hello sunflowers I
#1 person1 1 1 0 1
#2 person2 2 0 0 1
#3 person3 0 0 1 0