计算数据帧中句子中的bigram数量



我有一个看起来有点像这样的数据集:

sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2) 
df <- data.frame(sentences, id)

我想统计一下,在那里我可以看到某些bigram的出现。假设我有:

trigger_bg_1 <- "sample text"

我期望2的输出(因为在两个句子中出现了两次"样本文本"。我知道如何进行这样的单词计数:

trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")

for(i in unlist(words)){ 
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}

但我不能为一个bigram找到工作。关于我应该如何更改代码以使其工作,有什么想法吗?

但由于我对触发词有很长的测试,我需要在上计算

如果您想计算匹配的句子,可以使用grep:

length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2

如果你想计算你找到trigger_bg_1的次数,你可以使用gregexpr:

sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2

您可以将sum作为grepl

sum(grepl(trigger_bg_1, df$sentences))
[1] 2

如果您真的对双字感兴趣,而不仅仅是设置单词组合,那么quanteda包可以提供一种更实质、更系统的前进方式:

数据:

sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2) 
df <- data.frame(sentences, id)

解决方案:

library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)

结果:

bigrams
in sentence sample text     text in  sentence 1  sentence 2 
2           2           2           1           1 

如果您想检查特定二元图的频率计数:

bigrams["in sentence"]
in sentence 
2

最新更新