我有一个相当顽固的问题…我似乎无法从语料库中删除从*.txt
文件加载到R中的<+f0b7>
和<+f0a0>
字符串:
UPDATE以下是.txt
示例文件的链接:https://db.tt/qTRKpJYK
Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))
title
professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
<+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches
我试过添加到:
badWords <- unique(c(stopwords("en"),
stopwords("SMART")[stopwords("SMART") != "c"],
as.character(1970:2050),
"<U+F0B7>", "<+f0b7>",
"<U+F0A0>", "<+f0a0>",
"january", "jan",
"february", "feb",
"march", "mar",
"april", "apr",
"may", "may",
"june", "jun",
"july", "jul",
"august", "aug",
"september", "sep",
"october", "oct",
"november", "nov",
"december", "dec"))
使用:和
tm_map(candidates.Corpus, removeWords, badWords)
但不知何故这不起作用。我也试过用gsub("<+f0a0>", "", tmp, perl = FALSE)
这样的东西来重新exp它,这在R内的字符串上工作,但不知何故,当我读取.txt
文件时,这些字符仍然出现。
这些角色有什么独特之处吗?我怎么摆脱他们?
好的。问题是数据中有一个不寻常的unicode字符。在R语言中,我们通常将这个字符转义为"uf0b7"。但是当inspect()
打印它的数据时,它将其编码为"。观察
sample<-c("Crazy uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))
# A document-term matrix (1 documents, 3 terms)
#
# Non-/sparse entries: 3/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs <U+F0B7> character crazy
# 1 1 1 1
(实际上我必须在运行R 3.0.2的Windows机器上创建这个输出-它在运行R 3.1.0的Mac上运行得很好)
不幸的是,您将无法使用remove words删除此内容,因为该函数中使用的正则表达式要求单词边界出现在"word"的两侧,并且由于这似乎不是边界的可识别字符。看到
gsub("uf0b7","",sample)
# [1] "Crazy Character"
gsub("\buf0b7\b","",sample)
#[1] "Crazy Character"
所以我们可以编写自己的函数来使用tm_map
。考虑
removeCharacters <-function (x, characters) {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}
基本上是removeWords函数,只是没有边界条件。然后我们可以运行
cp2 <- tm_map(cp, removeCharacters, c("uf0b7","uf0a0"))
inspect(DocumentTermMatrix(cp2))
# A document-term matrix (1 documents, 2 terms)
#
# Non-/sparse entries: 2/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs character crazy
# 1 1 1
我们看到那些unicode字符不在那里了