在 TermDocumentMatrix(( 中,参数removeNumbers=TRUE
删除英语语料库中的阿拉伯数字。如何删除罗马数字(例如"iii","xiv"和"xiii",无论如何(和阿拉伯数字? 我可以removeNumbers
参数提供什么自定义函数来实现这一点?
我试图理解和修改的代码:
library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
titles = c("Wuthering Heights", "A Tale of Two Cities",
"Alice's Adventures in Wonderland", "The Adventures of Sherlock Holmes")
##read in those books
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("\bchapter\b", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
import_corpus = Corpus ( VectorSource (by_chapter$text))
no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]
import_mat = DocumentTermMatrix (import_corpus,
control = list (stemming = TRUE, #create root words
stopwords = TRUE, #remove stop words
minWordLength = 3, #cut out small words
removeNumbers = no_romans, #take out the numbers
removePunctuation = TRUE)) #take out punctuation
以下分析表明,罗马数字仍然存在,例如"iii"和"xii"。
> st = import_mat$dimnames$Term
> st[grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(st))]
[1] "cli" "iii" "mix" "vii" "viii" "xii" "xiii" "xiv"
[9] "xix" "xvi" "xvii" "xviii" "xxi" "xxii" "xxiii" "xxiv"
[17] "xxix" "xxv" "xxvi" "xxvii" "xxviii" "xxx" "xxxi" "xxxii"
[25] "xxxiii" "xxxiv"
试试这些选项。
library(tm)
dat <- VCorpus(VectorSource(c("iv. Chapter Four", "I really want to discuss the proper mix of 17 ingredients.", "Nothing to remove here.")))
inspect( DocumentTermMatrix(dat) )
# <<DocumentTermMatrix (documents: 3, terms: 13)>>
# Non-/sparse entries: 13/26
# Sparsity : 67%
# Maximal term length: 12
# Weighting : term frequency (tf)
# Sample :
# Terms
# Docs chapter discuss four here. ingredients. iv. mix nothing proper really
# 1 1 0 1 0 0 1 0 0 0 0
# 2 0 1 0 0 1 0 1 0 1 1
# 3 0 0 0 1 0 0 0 1 0 0
格雷戈尔的警告之一——"我">这个词——似乎不存在,所以我们暂时不会担心这一点。格雷戈尔的另一个警告是"混合">这个词,它既是合法的数字,也是罗马数字。删除简单/整数罗马数字的基本功能可能是:
no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans)) )
# <<DocumentTermMatrix (documents: 3, terms: 12)>>
# Non-/sparse entries: 12/24
# Sparsity : 67%
# Maximal term length: 12
# Weighting : term frequency (tf)
# Sample :
# Terms
# Docs chapter discuss four here. ingredients. iv. nothing proper really remove
# 1 1 0 1 0 0 1 0 0 0 0
# 2 0 1 0 0 1 0 0 1 1 0
# 3 0 0 0 1 0 0 1 0 0 1
这会删除"mix"
但留下"iv."
。如果您需要删除它,那么也许
no_romans2 <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[.]?$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans2)) )
# <<DocumentTermMatrix (documents: 3, terms: 11)>>
# Non-/sparse entries: 11/22
# Sparsity : 67%
# Maximal term length: 12
# Weighting : term frequency (tf)
# Sample :
# Terms
# Docs chapter discuss four here. ingredients. nothing proper really remove the
# 1 1 0 1 0 0 0 0 0 0 0
# 2 0 1 0 0 1 0 1 1 0 1
# 3 0 0 0 1 0 1 0 0 1 0
(唯一的区别是在正则表达式末尾附近添加[.]?
。
(顺便说一句:可以使用grepl(..., ignore.case=TRUE)
来获得与此处使用的toupper(s)
相同的效果。在小样本测试中速度稍慢,但效果是一样的。