r语言 - 如何读取".TAB"文件



我正在尝试找到一种通过R从哈佛Dataverse网站检索数据的方法。 我正在使用"dataverse"和"dvn"包等。 许多数据文件以".tab"结尾,尽管它们没有格式化为普通制表符分隔的文本。

我已经这样做了:

library(dataverse)   
## 01. Using the dataverse server and making a search
Sys.setenv("DATAVERSE_SERVER" ="dataverse.harvard.edu")
## 02. Loading the dataset that I chose, by url
doi_url <- "https://doi.org/10.7910/DVN/ZTCWYQ"
my_dataset <- get_dataset(doi_url)
## 03. Grabbing the first file of the dataset
## which is named "001_AppendixC.tab"
my_files <- my_dataset$files$label
my_file <- get_file(my_files[1], doi_url)
AppendixC <- tempfile()
writeBin(my_file, AppendixC)
read.table(AppendixC)
> Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
> line 1 did not have 2 elements
> In addition: Warning message:
> In read.table(AppendixC) :
> line 1 appears to contain embedded nulls

有什么提示吗?

问题是dataverse::get_file()原始二进制格式返回文件。将其加载到内存中的最简单方法是使用writeBin()将其写入临时文件,然后使用适当的导入/读取函数读取该文件。

这是一个应该自动将其读入内存的函数

# Uses rio, which automatically chooses the appropriate import/read
# function based on file type.
library(rio)
install_formats()                       # only needs to run once after
# pkg installation
load_raw_file <- function(raw, type) {
match.arg(
arg = type,
choices = c(      
"csv", "tab", "psc", "tsv", "sas7bdat",
"sav", "dta", "xpt", "por", "xls", "xlsx",
"R", "RData", "rda", "rds", "rec", "mtb",
"feather", "csv.gz", "fwf"
)
)
tmp <- tempfile(fileext = paste0(".", type))
writeBin(as.vector(raw), tmp)
out <- import(tmp)
unlink(tmp)
out
}

让我们尝试一下您的文件,这是一个 excel 文件。

library(dataverse)
raw <- get_file(
"001_AppendixC.tab",
"https://doi.org/10.7910/DVN/ZTCWYQ"
)
data <- load_raw_file(raw, "xlsx")

并查看数据:

str(data)
> 'data.frame': 132 obs. of  17 variables:
>  $ Country  : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
>  $ UN_9193  : chr  "37.4" "7.7" "9.1" "65.400000000000006" ...
>  $ UN_9901  : chr  "46.1" "7.2" "10.7" "50" ...
>  $ UN_0709  : chr  "24.6" "9.6999999999999993" "7.5" "23.7" ...
>  $ UN_1416  : chr  "23" "4.9000000000000004" "4.5999999999999996" "14" ...
>  $ stu90_94 : chr  "51.3" "37.200000000000003" "22.9" "52.9" ...
>  $ stu98_02 : chr  "54.7" "39.200000000000003" "23.6" "47.1" ...
>  $ stu06_10 : chr  "51.3" "23.1" "13.2" "29.2" ...
>  $ stu12_16 : chr  "40.9" "17.899999999999999" "11.7" "37.6" ...
>  $ wast90_94: chr  "11.5" "9.4" "7.1" "7.9" ...
>  $ wast98_02: chr  "13.4" "12.2" "3.1" "8.6999999999999993" ...
>  $ wast06_10: chr  "8.9" "9.4" "4.0999999999999996" "8.1999999999999993" ...
>  $ wast12_16: chr  "9.5" "6.2" "4.0999999999999996" "4.9000000000000004" ...
>  $ UM1992   : chr  "16.8" "3.7" "4.5" "22.6" ...
>  $ UM2000   : chr  "13.7" "2.6" "4" "21.7" ...
>  $ UM2008   : chr  "11" "1.8" "2.9" "19.2" ...
>  $ UM2015   : chr  "9.1" "1.4" "2.6" "15.7" ...

最新更新