如何在R中转义函数read.table中的特殊字符



我正试图在卡内基梅隆大学发音词典的R中提取一个数据帧。这可以很好地获得数据帧:

url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
library(RCurl)
answer <- RCurl::getURL(url)
dictionary <- as.vector(unlist(strsplit(answer, "n")))
dictionary <- gsub("  ", "t", dictionary)
dictionary.df <- read.table(text = dictionary, header=FALSE, skip =150, sep = "t")

但是字典的内容在第54行之后,所以skip参数的值应该是"54"。第54到150行中包含的特殊字符似乎会导致以下错误。

例如:

> dictionary.df <- read.table(text = dictionary, header=FALSE, skip =54, sep = "t")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
line 1 did not have 2 elements

> dictionary.df <- read.table(text = dictionary, header=FALSE, skip =120, sep = "t")
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
Fin de fichier (EOF) dans une chaîne de caractères entre guillements
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
le nombre d'objets lus n'est pas un multiple du nombre de colonnes

有没有一种快速的方法可以避免转义字符时出现这种错误?

非常感谢你的帮助!

Ludovic

data.table包中的fread在这里似乎很合适。

library(data.table)
dt_dic <- fread(url, skip=56, sep=NULL, header = FALSE, col.names="Item")
dt_dic[, c("Item", "Pronunciation") := tstrsplit(Item, "  ")]
dt_dic
Item                            Pronunciation
1: !EXCLAMATION-POINT EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
2:       "CLOSE-QUOTE                      K L OW1 Z K W OW1 T
3:      "DOUBLE-QUOTE                  D AH1 B AH0 L K W OW1 T
4:      "END-OF-QUOTE                  EH1 N D AH0 V K W OW1 T
5:         "END-QUOTE                        EH1 N D K W OW1 T
---                                                            
133850:             {BRACE                                B R EY1 S
133851:        {LEFT-BRACE                      L EH1 F T B R EY1 S
133852:        {OPEN-BRACE                    OW1 P EH0 N B R EY1 S
133853:       }CLOSE-BRACE                      K L OW1 Z B R EY1 S
133854:       }RIGHT-BRACE                        R AY1 T B R EY1 S

我认为这是一个x/y问题。

内存中已经有字符向量dictionary中的数据,并且您希望将其转换为数据帧。您正试图使用read.table来完成此操作,但由于read.table正在与向量中的一些特殊字符作斗争,因此出现了问题。与其试图找出一种方法来强制read.table完成这项工作,为什么不在双空格处拆分字符串并将它们绑定到一个数据帧中呢?

当我下载文件时,标头占用了56行而不是54行,所以我们删除这些行,然后在剩余行的双空格上调用strsplit,而不必首先将它们转换为t字符。然后,我们在结果列表上使用as.data.frame(do.call("rbind", ...))来获得我们的数据帧。

这里有一个代表:

url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
answer <- RCurl::getURL(url)
dictionary <- as.vector(unlist(strsplit(answer, "n")))
dictionary.df <- strsplit(dictionary[-seq(56)], "  ")
dictionary.df <- as.data.frame(do.call("rbind", dictionary.df), stringsAsFactors = FALSE)
names(dictionary.df) <- c("Item", "Pronunciation")
head(dictionary.df)
#>                 Item                            Pronunciation
#> 1 !EXCLAMATION-POINT EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
#> 2       "CLOSE-QUOTE                      K L OW1 Z K W OW1 T
#> 3      "DOUBLE-QUOTE                  D AH1 B AH0 L K W OW1 T
#> 4      "END-OF-QUOTE                  EH1 N D AH0 V K W OW1 T
#> 5         "END-QUOTE                        EH1 N D K W OW1 T
#> 6         "IN-QUOTES                        IH1 N K W OW1 T S

由reprex包(v0.3.0(于2020-03-09创建

我想通过计算时间来比较所提出的解决方案:

dic.url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
function1 <- function(dic.url){
start_time <- Sys.time()
library(data.table)
dic <- fread(dic.url, skip=56, sep=NULL, header = FALSE, col.names="Item")
dic[, c("Item", "Pronunciation") := tstrsplit(Item, "  ")]
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(dic)
}
function2 <- function(dic.url){
start_time <- Sys.time()
answer <- RCurl::getURL(dic.url)
dic <- as.vector(unlist(strsplit(answer, "n")))
dic <- strsplit(dic[-seq(56)], "  ")
dic <- as.data.frame(do.call("rbind", dic), stringsAsFactors = FALSE)
names(dic) <- c("Item", "Pronunciation")
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(dic)
}
dic <- function1(dic.url)
dic <- function2(dic.url)

一些迹象:

> dic <- function1(dic.url)
Time difference of 2.627239 secs
> dic <- function2(dic.url)
Time difference of 3.394491 secs

最新更新