r语言 - 从 data.table 包读取带有 fread 函数的 zip 文件时出错



使用 fread(( 函数从 https 网站读取压缩的".txt"文件时出错

大家好,

我正在尝试从具有fread()功能的https网站读取压缩的".txt"文件,但是我收到并出错。

下载后我也尝试读取zip文件,但是我遇到了同样的错误。 有什么想法如何解决吗?

fileUrl <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
dt <- fread(fileUrl)
Error in fread(fileUrl) : 
Internal error: invalid head position. jump=1, headPos=0000020B75510005, thisJumpStart=0000020B7560C040, sof=0000020B75510000
### tried read locally after download too:
dt <- fread("Dataset.zip")

但是我收到了相同的错误消息。

### unzipped, the file is read without error:
dt <- fread("household_power_consumption.txt")
str(dt)
Classes ‘data.table’ and 'data.frame':  2075259 obs. of  9 variables:
$ Date                 : chr  "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
$ Time                 : chr  "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
$ Global_active_power  : chr  "4.216" "5.360" "5.374" "5.388" ...
$ Global_reactive_power: chr  "0.418" "0.436" "0.498" "0.502" ...
$ Voltage              : chr  "234.840" "233.630" "233.290" "233.740" ...
$ Global_intensity     : chr  "18.400" "23.000" "23.000" "23.000" ...
$ Sub_metering_1       : chr  "0.000" "0.000" "0.000" "0.000" ...
$ Sub_metering_2       : chr  "1.000" "1.000" "2.000" "1.000" ...
$ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
- attr(*, ".internal.selfref")=<externalptr>

fread不会自动读取.zip文件,但您可以从 R 中跨平台解压缩它们:

tmp_dir = tempdir()
tmp = tempfile(tmpdir = tmp_dir)
download.file(fileUrl, tmp)
outf = unzip(tmp, list = TRUE)$Name
unzip(tmp, outf, exdir = tmp_dir)
fread(file.path(tmp_dir, outf))[1:10]
Date     Time Global_active_power Global_reactive_power Voltage
1: 16/12/2006 17:24:00               4.216                 0.418 234.840
2: 16/12/2006 17:25:00               5.360                 0.436 233.630
3: 16/12/2006 17:26:00               5.374                 0.498 233.290
4: 16/12/2006 17:27:00               5.388                 0.502 233.740
5: 16/12/2006 17:28:00               3.666                 0.528 235.680
6: 16/12/2006 17:29:00               3.520                 0.522 235.020
7: 16/12/2006 17:30:00               3.702                 0.520 235.090
8: 16/12/2006 17:31:00               3.700                 0.520 235.220
9: 16/12/2006 17:32:00               3.668                 0.510 233.990
10: 16/12/2006 17:33:00               3.662                 0.510 233.860
Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1:           18.400          0.000          1.000             17
2:           23.000          0.000          1.000             16
3:           23.000          0.000          2.000             17
4:           23.000          0.000          1.000             17
5:           15.800          0.000          1.000             17
6:           15.000          0.000          2.000             17
7:           15.800          0.000          1.000             17
8:           15.800          0.000          1.000             17
9:           15.800          0.000          1.000             17
10:           15.800          0.000          2.000             16

只是一个简短的更新:您可以在fread中使用 shell 命令来提取文件,如下所示:

url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(url, dest = "./household_power_c.zip", mode = "wb")
dt <- data.table::fread(cmd = "unzip -cq ./household_power_c.zip")

输出:


> str(dt)
Classes ‘data.table’ and 'data.frame':  2075259 obs. of  9 variables:
$ Date                 : chr  "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
$ Time                 : chr  "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
$ Global_active_power  : chr  "4.216" "5.360" "5.374" "5.388" ...
$ Global_reactive_power: chr  "0.418" "0.436" "0.498" "0.502" ...
$ Voltage              : chr  "234.840" "233.630" "233.290" "233.740" ...
$ Global_intensity     : chr  "18.400" "23.000" "23.000" "23.000" ...
$ Sub_metering_1       : chr  "0.000" "0.000" "0.000" "0.000" ...
$ Sub_metering_2       : chr  "1.000" "1.000" "2.000" "1.000" ...
$ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
- attr(*, ".internal.selfref")=<externalptr> 
> 

使用 shell 命令非常方便,您可以探索unzip命令中的所有选项(请参阅$ man unzip(,例如,仅提取一个文件:

url <- "http://www.bls.gov/cex/pumd/data/comma/diary14.zip"
download.file(url, dest = "dataset.zip", mode="wb")
shc = 'unzip -cq dataset.zip diary14/expd141.csv' # shell command to extract one file of many files within the zip directory
zd <- data.table::fread(cmd = shc))

有关在fread中使用命令行工具的详细信息,请参阅此链接:

https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread#1-using-command-line-tools-directly

最新更新