使用 fread(( 函数从 https 网站读取压缩的".txt"文件时出错
大家好,
我正在尝试从具有fread()
功能的https网站读取压缩的".txt"文件,但是我收到并出错。
下载后我也尝试读取zip文件,但是我遇到了同样的错误。 有什么想法如何解决吗?
fileUrl <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
dt <- fread(fileUrl)
Error in fread(fileUrl) :
Internal error: invalid head position. jump=1, headPos=0000020B75510005, thisJumpStart=0000020B7560C040, sof=0000020B75510000
### tried read locally after download too:
dt <- fread("Dataset.zip")
但是我收到了相同的错误消息。
### unzipped, the file is read without error:
dt <- fread("household_power_consumption.txt")
str(dt)
Classes ‘data.table’ and 'data.frame': 2075259 obs. of 9 variables:
$ Date : chr "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
$ Time : chr "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
$ Global_active_power : chr "4.216" "5.360" "5.374" "5.388" ...
$ Global_reactive_power: chr "0.418" "0.436" "0.498" "0.502" ...
$ Voltage : chr "234.840" "233.630" "233.290" "233.740" ...
$ Global_intensity : chr "18.400" "23.000" "23.000" "23.000" ...
$ Sub_metering_1 : chr "0.000" "0.000" "0.000" "0.000" ...
$ Sub_metering_2 : chr "1.000" "1.000" "2.000" "1.000" ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
- attr(*, ".internal.selfref")=<externalptr>
fread
不会自动读取.zip
文件,但您可以从 R 中跨平台解压缩它们:
tmp_dir = tempdir()
tmp = tempfile(tmpdir = tmp_dir)
download.file(fileUrl, tmp)
outf = unzip(tmp, list = TRUE)$Name
unzip(tmp, outf, exdir = tmp_dir)
fread(file.path(tmp_dir, outf))[1:10]
Date Time Global_active_power Global_reactive_power Voltage
1: 16/12/2006 17:24:00 4.216 0.418 234.840
2: 16/12/2006 17:25:00 5.360 0.436 233.630
3: 16/12/2006 17:26:00 5.374 0.498 233.290
4: 16/12/2006 17:27:00 5.388 0.502 233.740
5: 16/12/2006 17:28:00 3.666 0.528 235.680
6: 16/12/2006 17:29:00 3.520 0.522 235.020
7: 16/12/2006 17:30:00 3.702 0.520 235.090
8: 16/12/2006 17:31:00 3.700 0.520 235.220
9: 16/12/2006 17:32:00 3.668 0.510 233.990
10: 16/12/2006 17:33:00 3.662 0.510 233.860
Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 18.400 0.000 1.000 17
2: 23.000 0.000 1.000 16
3: 23.000 0.000 2.000 17
4: 23.000 0.000 1.000 17
5: 15.800 0.000 1.000 17
6: 15.000 0.000 2.000 17
7: 15.800 0.000 1.000 17
8: 15.800 0.000 1.000 17
9: 15.800 0.000 1.000 17
10: 15.800 0.000 2.000 16
只是一个简短的更新:您可以在fread
中使用 shell 命令来提取文件,如下所示:
url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(url, dest = "./household_power_c.zip", mode = "wb")
dt <- data.table::fread(cmd = "unzip -cq ./household_power_c.zip")
输出:
> str(dt)
Classes ‘data.table’ and 'data.frame': 2075259 obs. of 9 variables:
$ Date : chr "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
$ Time : chr "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
$ Global_active_power : chr "4.216" "5.360" "5.374" "5.388" ...
$ Global_reactive_power: chr "0.418" "0.436" "0.498" "0.502" ...
$ Voltage : chr "234.840" "233.630" "233.290" "233.740" ...
$ Global_intensity : chr "18.400" "23.000" "23.000" "23.000" ...
$ Sub_metering_1 : chr "0.000" "0.000" "0.000" "0.000" ...
$ Sub_metering_2 : chr "1.000" "1.000" "2.000" "1.000" ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
- attr(*, ".internal.selfref")=<externalptr>
>
使用 shell 命令非常方便,您可以探索unzip
命令中的所有选项(请参阅$ man unzip
(,例如,仅提取一个文件:
url <- "http://www.bls.gov/cex/pumd/data/comma/diary14.zip"
download.file(url, dest = "dataset.zip", mode="wb")
shc = 'unzip -cq dataset.zip diary14/expd141.csv' # shell command to extract one file of many files within the zip directory
zd <- data.table::fread(cmd = shc))
有关在fread
中使用命令行工具的详细信息,请参阅此链接:
https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread#1-using-command-line-tools-directly