我有这样的文本:
dat<-c("this is my farm this is my land")
我想获得所有可能的 2 个单词组合及其频率。 我不能使用tm
包,因此任何其他解决方案将不胜感激。 输出应如下所示:
two words freq
this is 2
is my 2
my farm 1
my land 1
可以通过拆分dat
然后提取连续的两个单词组合来生成组合。然后,gregexpr
可以用来计算出场次数。
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(2:length(temp), function(i)
paste(temp[(i-1):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is is my my farm farm this my land
# 2 2 1 1 1
或三个单词组合
temp = unlist(strsplit(dat, " "))
temp2 = unique(sapply(3:length(temp), function(i)
paste(temp[(i-2):i], collapse = " ")))
sapply(temp2, function(x)
length(unlist(gregexpr(pattern = x, text = dat))))
# this is my is my farm my farm this farm this is is my land
# 2 1 1 1 1