R Tm包字典匹配导致比文本的实际单词更高的频率



我一直在使用下面的代码将文本加载为语料库,并使用tm包清理文本。作为下一步,我将加载一本字典并对其进行清理。然后我将文本中的单词与字典进行匹配,以计算分数。然而,匹配导致比文本中的实际单词更高数量的匹配(例如,能力得分为1500,但文本中的单词实际数量仅为1000(。

我认为这与文本和词典的词干有关,因为当没有执行词干处理时,匹配度较低。

你知道为什么会发生这种事吗?

非常感谢。

R代码

步骤1将数据存储为语料库

file.path <- file.path(here("Generated Files", "Data Preparation")) corpus <- Corpus(DirSource(file.path))

步骤2清洁数据

#Removing special characters
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "@")
corpus <- tm_map(corpus, toSpace, "\|") 
#Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("view", "pdf")) 
#Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
#Text stemming
corpus <- tm_map(corpus, stemDocument)
#Unique words
corpus <- tm_map(corpus, unique)

步骤3 DTM

dtm <- DocumentTermMatrix(corpus)

步骤4加载字典

dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)

步骤5计数频率

corpus.terms = colnames(dtm)
competence = match(corpus.terms, dic.competence, nomatch=0)

步骤6计算分数

competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)

competence在运行该行时返回什么?我不知道你的字典是怎么编的,所以我不能肯定那里发生了什么。我引入了我自己的随机语料库文本作为主要文本,并引入了一个单独的语料库作为词典,你的代码运行得很好。competence.score.df的行名是我语料库中不同txt文件的名称,得分都在0-1范围内。

# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),
control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
removePunctuation = TRUE))
# then I used your programming and it worked as I think you were expecting
# notice what I used here for the dictionary    
(competence = match(colnames(dtm), 
Terms(tdm)[1:10], # I only used the first 10 in my test of your code
nomatch = 0))
(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))

最新更新