R 中的编码、模式匹配和嘈杂文本问题



我们在使用从网络自动下载的文本进行编码和模式匹配方面遇到了问题。

我们需要一些帮助来了解问题所在以及如何解决它。 就个人而言,我必须承认,在阅读了这么多关于该主题的帖子后,我完全感到困惑:-)

我们的文本有时包括: 1)令人不安的Unicode(我已经读过这个(自动转义Unicode字符),但我不确定它可以通过哪种方式帮助正则表达式)

2)奇怪的引号(例如刻度和双刻度,我们无法自动识别(此页面 https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html 很有用,但是我们应该如何在代码中应用此代码?

我已经应用了这篇文章中给出的建议:如何在 RStudio 中更改 R 的语言环境?

长话短说,我将提供一个示例(从众多例子中)来展示我们正在遇到的问题类型。

阅读下面的代码(不优雅,但易于阅读):目标是在输入文件"_ansiktstics_corpus.txt"中搜索模式"当前 URL http://[\S]*"并将匹配项保存到输出文件中。

我在底部提供了一个输入文件的代码片段。

我们的文本中使用的语言是瑞典语。我正在Windows 10上工作。

-----启动代码

library(stringr)
rm(list=ls(all=TRUE)) # clear memory
setwd("yourPath”)
seeds_eCare01 <- dir(pattern = "_ansiktstics_corpus.txt") # we have a file list, but for the sake of this example I specify only a file that I attach to allow the reproducibility of the experiment
cat(file="yourPath/urls.csv", append =FALSE)
urlPattern<-"CURRENT URL http://[\S]*" # all the lines containing the following pattern (we are interested in extracting the URLs)
for (f in seeds_eCare01) {
tempData = scan( f, what="character", encoding = "UTF-8", sep="",quote = NULL) 
urlList<-grep("urlPattern", tempData, value = TRUE, perl = TRUE) 
# we tried also with “gsub”, we get the same error
cat(urlList,file="yourPath/urls.csv", sep="n",append = TRUE)
}

-----端代码

控制台输出如下:

---启动控制台输出

Read 13354 items
Warning messages:
1: In grep("urlPattern", tempData, value = TRUE, perl = TRUE) :
input string 18 is invalid UTF-8
2: In grep("urlPattern", tempData, value = TRUE, perl = TRUE) :
input string 19 is invalid UTF-8
3: In grep("urlPattern", tempData, value = TRUE, perl = TRUE) :
input string 4590 is invalid UTF-8
4: In grep("urlPattern", tempData, value = TRUE, perl = TRUE) :
input string 4591 is invalid UTF-8
5: In grep("urlPattern", tempData, value = TRUE, perl = TRUE) :
input string 4593 is invalid UTF-8

---EDN 控制台输出

在文件中找不到模式,尽管存在"当前URL"。

我的默认区域设置是:

> Sys.getlocale()
[1] "LC_COLLATE=Swedish_Sweden.1252;LC_CTYPE=Swedish_Sweden.1252;LC_MONETARY=Swedish_Sweden.1252;LC_NUMERIC=C;LC_TIME=Swedish_Sweden.1252"

现在,让我们进入问题:

我要读取和搜索的文件是使用 encoding=UTF-8 上传

scan( f, what="character", encoding = "UTF-8", sep="",quote = NULL)

但是当我运行以下检查(检查临时数据是否为 UTF-8)时,我得到 FALSE

all(stri_enc_isutf8(tempData))
[1] FALSE
> stri_enc_mark(tempData)
[1] "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8"
[18] "UTF-8" "UTF-8" "ASCII" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII"

我也尝试了使用"readlines"和连接(如 Gries 2017 所建议的那样),但我遇到了类似的问题:

(tempData<-readLines(con<-file(f, encoding="UTF-8"), warn=FALSE)); close(con)

当我运行以下内容时,我得到一个 TRUE,但 grep 失败与"扫描"完全一样

all(stri_enc_isutf8(tempData))
[1] TRUE

但是,当我运行以下命令(检查编码)时,我得到了 ascii 和 uft-8 的混合物

stri_enc_mark(tempData)
[1] "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"

问题 1)模式匹配不起作用:为什么? 2)编码不是一致的UTF-8,尽管我们强制它:为什么? 3)我们如何摆脱或转换令人不安的字符? 任何建议,提示或见解将不胜感激。

提前致谢

干杯,玛丽娜

input file: “_ansiktstics_corpus.txt"
<>
<header>
<usergroup>lay/specialized</usergroup>
<annotator's comments>[...]</annotator's comments>
</header>
</>CURRENT URL http://www.aftonbladet.se/halsa/article11873667.ab
”Men det är inte som på film”
 Miriam Jaakola, 30, var 26 år när hon fick sin diagnos. Foto: tomas bergman
Hon varken svär eller säger fula ord — men hon drabbas ofta av tics och har svårt att tygla sina
impulser.
Miriam Jaakola, 30, har Tourettes syndrom.
”Jag har haft tics så länge jag kan minnas. När jag var barn hade jag mycket ansiktstics. Jag blinkade, gapade och gjorde grimaser. Jag brukade kalla det för mina ovanor och tänkte egentligen inte så mycket på det. Det var så mycket annat i mitt liv som var rörigt. Jag har en ganska mild form av Tourettes syndrom, så mina tics är inte så tydliga. Det är både på gott och på ont. Folk som inte vet om det märker oftast inte att jag har tourette. Själv förstod jag det inte heller förrän jag var 26 år och fick min diagnos.
Svär inte
Den vanligaste fördomen är att alla som har Tourettes syndrom svär eller säger fula ord. Men ticsen kan se väldigt olika ut. De ändrar sig också över tid. Det är bara en tredjedel av alla med tourette som säger fula ord under någon period i sitt liv. Jag har aldrig gjort det.
Tourette är en sorts känsla – som ett tryck eller en frustration – som byggs upp inom en och till slut bryter ut i tics. Ungefär som när man nyser eller känner att en nysning är på väg. Jag kan hålla tillbaka mina tics om jag är ute bland folk, men då blir de oftast
värre efteråt.Det fi nns en stark energi i tourette. I dag när jag vet hur jag ska hantera det, kan jag vända den energin till något
positivt, till exempel i mitt jobb. Men tourette kan också ställa till problem. Jag har svårt att koncentrera mig och kontrollera impulser. När jag var yngre kunde jag få blixtsnabba utbrott. Jag minns fortfarande första gången jag reflekterade över det. Jag var runt tio år och stod och pillade med något. Plötsligt kastade jag iväg alla grejer. Hade jag haft en allvarligare form av tourette hade jag säkert skrikit eller fått en spasmurladdning. Jag minns att jag tänkte: Undrar om alla har det så här, eller om det bara är jag?
Skoltiden jobbig
Skoltiden var jättejobbig.

这里的"readlines"还可以,不确定编码会发生什么,但我没有得到任何错误或警告:-)

library(stringr)
rm(list=ls(all=TRUE)) # clear memory
setwd("path")
seeds_eCare01 <- dir(pattern = "_ansiktstics_corpus.txt")# see snippet above
cat("seed;NumOfWebDoc;CumulativeSum",file="outputFile",  sep="n", append =FALSE)

urlPattern<-"<\/>CURRENT URL" 
totURLs<-0
for (f in seeds_eCare01) {
(tempData<-readLines(con<-file(f, encoding="UTF-8"), warn=FALSE)); close(con)
urlList<-grep(urlPattern, tempData, value = TRUE, perl = TRUE) # 
countURLsPerSeed<-length(urlList)
totURLs<-totURLs + countURLsPerSeed
out1<-c(f, countURLsPerSeed,totURLs)
out2<-paste(out1,collapse=";")
cat(out2,file="outputFile", sep="n",append = TRUE)
} 

Wiktor帮助我编写了这段代码。该代码将嘈杂的文本语料库转换为干净的字符串数据集(.csv)

rm(list=ls(all=TRUE))
library(NLP)
library(tm)
# Settings
kNonAnnotatedTextsPath <- "path"  # The folder path for the text content.
kAnnotatedTextsPath <- "path"  # The folder path for the categories. 
kOutputPath <- "path"  # The destination for the output file.
kOutputFileName <- "output.csv"  # The name and format of the output file
kOverwriteOutputFile <- TRUE  # Overwrite or append the content to the output file? TRUE = Overwrite, FALSE = Append.
kWarn <- TRUE  # Warn the user of uncommon categories. 
# Patterns
kStartPattern <- "CURRENT URL"  # The text the occur on the line before the text content.
kBreakPattern <- "<>"  # The text that occur on the line after the text content.
kCategoryPattern <- ".*<usergroup>(.*)</usergroup>.*"  # The text that surrounds the category: (.*)

ExtractCategories <- function(file.name){
# Extracts the categories for a given file. Returns in form of a list vector.
con <- file(paste(kAnnotatedTextsPath, "/", file.name, sep = ""), encoding="UTF-8")
document.sections <- readLines(con, warn=FALSE)
close(con)
document.categories <- vector(mode = "list")
document.names <- c()
for(section in document.sections){
if (grepl(kCategoryPattern, section)){
document.categories <- c(document.categories, gsub(kCategoryPattern, "\1", section))
}
if (grepl(kStartPattern, section)){
document.names <- c(document.names, section)
}
}
names(document.categories) <- document.names
return(document.categories)
}
ExtractDocuments <- function(file, provided.categories){
# Extracts the text content from a given file, and appends the correct category.
# Returns a list of two, one with a list the text content and one list with the corresponding categories.
collect <- FALSE
con <- file(paste(kNonAnnotatedTextsPath, "/", file, sep = ""), encoding="UTF-8")
document.sections <- readLines(con, warn=FALSE)
close(con)
document.string <- ""
document.list <- c()
document.categories <- c()
document.name <- ""
for(section in document.sections){
if(grepl(kStartPattern, section)){
document.name <- section
collect <- TRUE
} else if(collect == TRUE && grepl(kBreakPattern, section)){
document.categories <- c(document.categories, get(document.name, provided.categories))
document.list <- c(document.list, document.string)
document.name <- ""
document.string <- ""
collect <- FALSE
} else if(collect){
document.string <- paste(document.string, section, sep = " ")
}
}
if(nchar(document.string) != 0){
document.categories <- c(document.categories, get(document.name, provided.categories))
document.list <- c(document.list, document.string)
}
return(cbind(c(document.list), c(document.categories)))
}
RemoveMisc <- function(string){
# Removes the following characters: ”, —, –, '
gsub(pattern = "[\x{201d}\x{2014}\x{2013}\x{02B9}]", replacement = "", string, perl = TRUE)
}
RemoveStartSpace <- function(string){
# Removes space at the start of a paragraph.
gsub(pattern = "^[ ]", replacement = "", string, perl = TRUE)
} 
RemoveMultiSpace <- function(string){
# Removes multiple occurances of space in a row,
gsub(pattern = "[ ]{2,}", replacement = " ", string, perl = TRUE)
} 
RemoveWebsites <- function(string){
# Removes the common webpage formates from the text.
gsub(pattern = "(?:(?:(?:http[s]*://)|(?:www\.))+[\S]*)", replacement = "", string, perl = TRUE)
} 
CleanDocuments <- function(documents){
# Cleans the documents of unwanted (combinations of) signs, and replaces uppcarse letters with lowercase.
# Returns the documents as a corpus object.
corpus <- Corpus(VectorSource(documents[, 1]))
meta(corpus, type="indexed", tag="Category") <- documents[, 2]
corpus <- tm_map(corpus, RemoveWebsites)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, RemoveMisc)
corpus <- tm_map(corpus, RemoveStartSpace)
corpus <- tm_map(corpus, RemoveMultiSpace)
corpus <- tm_map(corpus, tolower)
return(corpus)
}
SaveDocuments <- function(corpus, output.file, warn = FALSE){
# Saves the documents to a csv file in the format: '<text>',<category>
counter = 1
while (counter <= length(corpus)){
text <- as.character(corpus[[counter]])
category <- as.character(meta(corpus)[[1]][counter])
if(warn && !(category %in% c("lay", "specialized"))){
print("Warning!")
print(paste("Unusual classification '", category, "'", ", in the following text:", sep = ""))
print(text)
}
padded_text <- paste("'", text, "',", category, sep = "")
write.table(x = padded_text, file = output.file, append = TRUE, sep = "", row.names = FALSE, col.names = FALSE, quote = FALSE)
counter <- counter + 1
}
}
CreateCorpus <- function(overwrite = FALSE){
# Iterates the files and creates the corpus, which is saved as a csv file.
output.file <- paste(kOutputPath, "/", kOutputFileName, sep = "")
seeds <- dir(path = kAnnotatedTextsPath, pattern = "*.txt")
if (overwrite) {
close(file(output.file, open="w"))
}
for (seed in seeds){
document.categories <- ExtractCategories(seed)
document.texts <- ExtractDocuments(seed, document.categories)
corpus <- CleanDocuments(document.texts)
SaveDocuments(corpus, output.file, kWarn)
}
}
CreateCorpus(kOverwriteOutputFile)

最新更新