删除unicode <+f0b7>从语料库文本



我有一个相当顽固的问题…我似乎无法从语料库中删除从*.txt文件加载到R中的<+f0b7><+f0a0>字符串:

UPDATE以下是.txt示例文件的链接:https://db.tt/qTRKpJYK

Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))

title
 professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
 <+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches

我试过添加到:

badWords <- unique(c(stopwords("en"), 
          stopwords("SMART")[stopwords("SMART") != "c"],
          as.character(1970:2050),
          "<U+F0B7>", "<+f0b7>",
          "<U+F0A0>", "<+f0a0>",
          "january",  "jan",
          "february",   "feb",
          "march",  "mar",
          "april",  "apr",
          "may",    "may",
          "june",   "jun",
          "july",   "jul",
          "august", "aug",
          "september",  "sep",
          "october",    "oct",
          "november",   "nov",
          "december",   "dec"))
使用:

tm_map(candidates.Corpus, removeWords, badWords)

但不知何故这不起作用。我也试过用gsub("<+f0a0>", "", tmp, perl = FALSE)这样的东西来重新exp它,这在R内的字符串上工作,但不知何故,当我读取.txt文件时,这些字符仍然出现。

这些角色有什么独特之处吗?我怎么摆脱他们?

好的。问题是数据中有一个不寻常的unicode字符。在R语言中,我们通常将这个字符转义为"uf0b7"。但是当inspect()打印它的数据时,它将其编码为"。观察

sample<-c("Crazy uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))
# A document-term matrix (1 documents, 3 terms)
# 
# Non-/sparse entries: 3/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs <U+F0B7> character crazy
#    1        1         1     1

(实际上我必须在运行R 3.0.2的Windows机器上创建这个输出-它在运行R 3.1.0的Mac上运行得很好)

不幸的是,您将无法使用remove words删除此内容,因为该函数中使用的正则表达式要求单词边界出现在"word"的两侧,并且由于这似乎不是边界的可识别字符。看到

gsub("uf0b7","",sample)
# [1] "Crazy  Character"
gsub("\buf0b7\b","",sample)
#[1] "Crazy  Character"

所以我们可以编写自己的函数来使用tm_map。考虑

removeCharacters <-function (x, characters)  {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}

基本上是removeWords函数,只是没有边界条件。然后我们可以运行

cp2 <- tm_map(cp, removeCharacters, c("uf0b7","uf0a0"))
inspect(DocumentTermMatrix(cp2))
# A document-term matrix (1 documents, 2 terms)
# 
# Non-/sparse entries: 2/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs character crazy
#    1         1     1

我们看到那些unicode字符不在那里了

最新更新