我正在尝试从R
中的URL
导入CSV
文件。该文件包含以特定字符串('<<<<<<< HEAD', '=======' or '>>>>>>> master'
(随机开头的行。包含这些字符的行位于随机的行位置。我希望避开这些行并导入文档的其余部分。有办法吗?我更喜欢使用FREAD来导入数据。感谢您的投入。
默认情况下不加载数据。它在遇到上述字符串的第一个实例(CSV的第347行(处抛出错误。我试图下载数据的URL是"https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv"
,它引发的错误如下:
[0%] Downloaded 0 bytes...
Warning message:
In data.table::fread("https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv", :
Stopped early on line 347. Expected 7 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<<<<<<<< HEAD>>
我用来下载数据的代码声明是:
covid_ds <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv')
您可以用read.csv
和fill = TRUE
读取数据,在date
列中只保留那些具有日期格式数据的行,以便删除'<<<<<<< HEAD'
或'======='
等值,并使用type_convert
在各自的类型中更改它们。
data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\d+-\d+-\d+', data$date), ]
data <- readr::type_convert(data)
data
# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows
使用data.table::fread
可以使用blank.lines.skip=TRUE
。
data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)