我正试图在卡内基梅隆大学发音词典的R中提取一个数据帧。这可以很好地获得数据帧:
url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
library(RCurl)
answer <- RCurl::getURL(url)
dictionary <- as.vector(unlist(strsplit(answer, "n")))
dictionary <- gsub(" ", "t", dictionary)
dictionary.df <- read.table(text = dictionary, header=FALSE, skip =150, sep = "t")
但是字典的内容在第54行之后,所以skip参数的值应该是"54"。第54到150行中包含的特殊字符似乎会导致以下错误。
例如:
> dictionary.df <- read.table(text = dictionary, header=FALSE, skip =54, sep = "t")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 2 elements
> dictionary.df <- read.table(text = dictionary, header=FALSE, skip =120, sep = "t")
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Fin de fichier (EOF) dans une chaîne de caractères entre guillements
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
le nombre d'objets lus n'est pas un multiple du nombre de colonnes
有没有一种快速的方法可以避免转义字符时出现这种错误?
非常感谢你的帮助!
Ludovic
data.table包中的fread
在这里似乎很合适。
library(data.table)
dt_dic <- fread(url, skip=56, sep=NULL, header = FALSE, col.names="Item")
dt_dic[, c("Item", "Pronunciation") := tstrsplit(Item, " ")]
dt_dic
Item Pronunciation
1: !EXCLAMATION-POINT EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
2: "CLOSE-QUOTE K L OW1 Z K W OW1 T
3: "DOUBLE-QUOTE D AH1 B AH0 L K W OW1 T
4: "END-OF-QUOTE EH1 N D AH0 V K W OW1 T
5: "END-QUOTE EH1 N D K W OW1 T
---
133850: {BRACE B R EY1 S
133851: {LEFT-BRACE L EH1 F T B R EY1 S
133852: {OPEN-BRACE OW1 P EH0 N B R EY1 S
133853: }CLOSE-BRACE K L OW1 Z B R EY1 S
133854: }RIGHT-BRACE R AY1 T B R EY1 S
我认为这是一个x/y问题。
内存中已经有字符向量dictionary
中的数据,并且您希望将其转换为数据帧。您正试图使用read.table
来完成此操作,但由于read.table
正在与向量中的一些特殊字符作斗争,因此出现了问题。与其试图找出一种方法来强制read.table
完成这项工作,为什么不在双空格处拆分字符串并将它们绑定到一个数据帧中呢?
当我下载文件时,标头占用了56行而不是54行,所以我们删除这些行,然后在剩余行的双空格上调用strsplit
,而不必首先将它们转换为t
字符。然后,我们在结果列表上使用as.data.frame(do.call("rbind", ...))
来获得我们的数据帧。
这里有一个代表:
url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
answer <- RCurl::getURL(url)
dictionary <- as.vector(unlist(strsplit(answer, "n")))
dictionary.df <- strsplit(dictionary[-seq(56)], " ")
dictionary.df <- as.data.frame(do.call("rbind", dictionary.df), stringsAsFactors = FALSE)
names(dictionary.df) <- c("Item", "Pronunciation")
head(dictionary.df)
#> Item Pronunciation
#> 1 !EXCLAMATION-POINT EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
#> 2 "CLOSE-QUOTE K L OW1 Z K W OW1 T
#> 3 "DOUBLE-QUOTE D AH1 B AH0 L K W OW1 T
#> 4 "END-OF-QUOTE EH1 N D AH0 V K W OW1 T
#> 5 "END-QUOTE EH1 N D K W OW1 T
#> 6 "IN-QUOTES IH1 N K W OW1 T S
由reprex包(v0.3.0(于2020-03-09创建
我想通过计算时间来比较所提出的解决方案:
dic.url <- "http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b"
function1 <- function(dic.url){
start_time <- Sys.time()
library(data.table)
dic <- fread(dic.url, skip=56, sep=NULL, header = FALSE, col.names="Item")
dic[, c("Item", "Pronunciation") := tstrsplit(Item, " ")]
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(dic)
}
function2 <- function(dic.url){
start_time <- Sys.time()
answer <- RCurl::getURL(dic.url)
dic <- as.vector(unlist(strsplit(answer, "n")))
dic <- strsplit(dic[-seq(56)], " ")
dic <- as.data.frame(do.call("rbind", dic), stringsAsFactors = FALSE)
names(dic) <- c("Item", "Pronunciation")
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(dic)
}
dic <- function1(dic.url)
dic <- function2(dic.url)
一些迹象:
> dic <- function1(dic.url)
Time difference of 2.627239 secs
> dic <- function2(dic.url)
Time difference of 3.394491 secs