请查看文件末尾的片段。我正在用箭头和R做我的第一个小步,以处理太大而无法加载到内存中的文件。
我正试图在这里重现的步骤
https://francoismichonneau.net/2022/10/import-big-csv/
它说明了如何在不将大型tsv文件加载到内存的情况下将其转换为镶木地板文件。我知道使用指向您的数据的链接是一种糟糕的做法,但我别无选择,只能提供一个最小的例子,但它仍然失败了。
你可以在这里下载test.tsv文件(一个大得多的文件的小摘录(
https://mega.nz/file/V9FnhazJ#YXBvpuRVQ9s3nVkWCQEaWn0g9Ul6pmtXlap6X7Zkz1E
在reprex中,我有一些错误,但我不明白它们来自哪里,而且它们似乎与数据不兼容。非常感谢您的帮助!
library(tidyverse)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
data1<- read_tsv("test.tsv") ## data in memory
#> Rows: 10000 Columns: 29
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "t"
#> chr (27): AID_MEASURE_ID, DATE_CREATED, DATE_GRANTED, AA_PUBLISHED_DATE, SER...
#> dbl (2): GRANTED_AMOUNT_FROM_EUR, NOMINAL_AMOUNT_EUR_FROM
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data1
#> # A tibble: 10,000 × 29
#> AID_MEASURE_ID DATE_…¹ DATE_…² AA_PU…³ SERVE…⁴ AM_TI…⁵ AM_TI…⁶ STATUS AM_PR…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SA.41416 24/11/… 18/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 2 SA.41416 24/11/… 18/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 3 SA.41416 24/11/… 18/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 4 SA.41416 24/11/… 18/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 5 SA.41416 25/11/… 25/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 6 SA.41416 25/11/… 29/08/… 09/12/… TM-100… NGA Sc… NGA Sc… Publi… N
#> 7 SA.45235 02/11/… 27/10/… 02/11/… TM-100… 2014–2… Measur… Publi… X
#> 8 SA.45235 02/11/… 27/10/… 02/11/… TM-100… 2014–2… Measur… Publi… X
#> 9 SA.33193 08/11/… 27/10/… 11/01/… TM-100… R&D&I-… R & D … Publi… N
#> 10 SA.42321 16/11/… 01/11/… 01/06/… TM-100… POTENC… POTENT… Publi… X
#> # … with 9,990 more rows, 20 more variables: COFINANCE <chr>, OBJECTIVE <chr>,
#> # OTHER_OBJECTIVE_EN <chr>, AID_INSTRUMENT <chr>,
#> # OTHER_AID_INSTRUMENT_EN <chr>, BENEFICIARY_NAME <chr>,
#> # BENEFICIARY_NAME_ENGLISH <chr>, BENEFICIARY_NATIONAL_ID <chr>,
#> # BENEFICIARY_NAT_ID_TYPE_SD <chr>, BENEFICIARY_TYPE_SD <chr>,
#> # COUNTRY_SD <chr>, REGION_SD <chr>, SECTOR_SD <chr>,
#> # GRANTED_AMOUNT_FROM_EUR <dbl>, NOMINAL_AMOUNT_EUR_FROM <dbl>, …
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
data2 <- open_dataset("test.tsv", format="tsv")
data2
#> FileSystemDataset with 1 csv file
#> AID_MEASURE_ID: string
#> DATE_CREATED: string
#> DATE_GRANTED: string
#> AA_PUBLISHED_DATE: string
#> SERVER_REF: string
#> AM_TITLE: string
#> AM_TITLE_EN: string
#> STATUS: string
#> AM_PROC_TYPE_CD: string
#> COFINANCE: string
#> OBJECTIVE: string
#> OTHER_OBJECTIVE_EN: null
#> AID_INSTRUMENT: string
#> OTHER_AID_INSTRUMENT_EN: string
#> BENEFICIARY_NAME: string
#> BENEFICIARY_NAME_ENGLISH: string
#> BENEFICIARY_NATIONAL_ID: string
#> BENEFICIARY_NAT_ID_TYPE_SD: string
#> BENEFICIARY_TYPE_SD: string
#> COUNTRY_SD: string
#> REGION_SD: string
#> SECTOR_SD: string
#> GRANTED_AMOUNT_FROM_EUR: double
#> NOMINAL_AMOUNT_EUR_FROM: double
#> GRANT_RANGE: string
#> GRANTING_AUTHORITY_NAME: string
#> GRANTING_AUTHORITY_NAME_EN: string
#> NUTS_CD: string
#> GRANTING_AUTHORITY_COUNTRY: string
data3 <- read_tsv_arrow("test.tsv",as_data_frame=F)
##This works
write_dataset(
data3,
format = "parquet",
path = ".",
max_rows_per_file = 1e7
)
print("Done with the first parquet file")
#> [1] "Done with the first parquet file"
## but this does not...why?
write_dataset(
data2,
format = "parquet",
path = ".",
max_rows_per_file = 1e7
)
#> Error: Invalid: In CSV column #11: Row #9107: CSV conversion error to null: invalid value 'SECURITY ARRANGEMENTS'
###Apparently there is an error with the string "Security Arrangements" on line
### 9107. Let us see
data_error <- data1[9107, ] |>
glimpse()
#> Rows: 1
#> Columns: 29
#> $ AID_MEASURE_ID <chr> "SA.33193"
#> $ DATE_CREATED <chr> "27/09/17"
#> $ DATE_GRANTED <chr> "14/09/17"
#> $ AA_PUBLISHED_DATE <chr> "27/09/17"
#> $ SERVER_REF <chr> "TM-10080819"
#> $ AM_TITLE <chr> "R&D&I-scheme Flanders. Prolongation of an …
#> $ AM_TITLE_EN <chr> "R & D & I-scheme Flanders. Extension of an…
#> $ STATUS <chr> "Published"
#> $ AM_PROC_TYPE_CD <chr> "N"
#> $ COFINANCE <chr> "No"
#> $ OBJECTIVE <chr> "Research and development"
#> $ OTHER_OBJECTIVE_EN <chr> NA
#> $ AID_INSTRUMENT <chr> "Direct grant"
#> $ OTHER_AID_INSTRUMENT_EN <chr> NA
#> $ BENEFICIARY_NAME <chr> "INDAVER"
#> $ BENEFICIARY_NAME_ENGLISH <chr> "INDAVER"
#> $ BENEFICIARY_NATIONAL_ID <chr> "0427.973.304"
#> $ BENEFICIARY_NAT_ID_TYPE_SD <chr> "KBO/BCE"
#> $ BENEFICIARY_TYPE_SD <chr> "Only large enterprises"
#> $ COUNTRY_SD <chr> "Belgium"
#> $ REGION_SD <chr> NA
#> $ SECTOR_SD <chr> "E.38.21-Treatment and disposal of non-haza…
#> $ GRANTED_AMOUNT_FROM_EUR <dbl> 644531
#> $ NOMINAL_AMOUNT_EUR_FROM <dbl> NA
#> $ GRANT_RANGE <chr> "Not Range"
#> $ GRANTING_AUTHORITY_NAME <chr> "VLAIO"
#> $ GRANTING_AUTHORITY_NAME_EN <chr> "VLAIO"
#> $ NUTS_CD <chr> NA
#> $ GRANTING_AUTHORITY_COUNTRY <chr> "Belgium"
### but I see nothing about that text on that line
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] arrow_9.0.0.2 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
#> [5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.8
#> [9] ggplot2_3.3.6 tidyverse_1.3.1
#>
#> loaded via a namespace (and not attached):
#> [1] lubridate_1.8.0 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
#> [5] R6_2.5.1 cellranger_1.1.0 backports_1.4.1 reprex_2.0.1
#> [9] evaluate_0.15 httr_1.4.3 highr_0.9 pillar_1.8.0
#> [13] rlang_1.0.4 readxl_1.4.0 R.utils_2.12.0 R.oo_1.25.0
#> [17] rmarkdown_2.14 styler_1.7.0 bit_4.0.4 munsell_0.5.0
#> [21] broom_1.0.0 compiler_4.2.1 modelr_0.1.8 xfun_0.31
#> [25] pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2 fansi_1.0.3
#> [29] crayon_1.5.1 tzdb_0.3.0 dbplyr_2.2.0 withr_2.5.0
#> [33] R.methodsS3_1.8.2 grid_4.2.1 jsonlite_1.8.0 gtable_0.3.0
#> [37] lifecycle_1.0.1 DBI_1.1.3 magrittr_2.0.3 scales_1.2.0
#> [41] vroom_1.5.7 cli_3.3.0 stringi_1.7.8 fs_1.5.2
#> [45] xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3 vctrs_0.4.1
#> [49] tools_4.2.1 bit64_4.0.5 R.cache_0.16.0 glue_1.6.2
#> [53] hms_1.1.1 parallel_4.2.1 fastmap_1.1.0 yaml_2.3.5
#> [57] colorspace_2.0-3 rvest_1.0.2 knitr_1.39 haven_2.5.0
由reprex软件包(v2.0.1(于2022-10-21创建
当您在不指定数据类型(模式(的情况下导入数据时,它们是从您的文件中推断出来的。read_tsv_arrow()
和open_dataset()
使用不同的策略来做到这一点:
read_tsv_arrow()
通过查看完整的数据集来实现open_dataset()
通过扫描数据集的前几行来实现
您得到的错误消息告诉您解析器无法将类似文本的内容转换为null
。
open_dataset()
报告的模式告诉您,唯一具有null
类型的列是OTHER_OBJECTIVE_EN
。如果查看read_tsv_arrow()
报告的模式,就会发现OTHER_OBJECTIVE_EN
是string
。这一差异将向您指出错误消息的含义。该列中实际上有文本数据,但它与解析器推断数据类型的内容相去甚远。在您的案例中,您不能依赖open_dataset()
的默认行为,您需要手动指定模式:
data <- open_dataset("test.tsv",
format = "tsv",
skip_rows = 1,
schema = schema(
AID_MEASURE_ID = string(),
DATE_CREATED = string(),
DATE_GRANTED = string(),
AA_PUBLISHED_DATE = string(),
SERVER_REF = string(),
AM_TITLE = string(),
AM_TITLE_EN = string(),
STATUS = string(),
AM_PROC_TYPE_CD = string(),
COFINANCE = string(),
OBJECTIVE = string(),
OTHER_OBJECTIVE_EN = string(),
AID_INSTRUMENT = string(),
OTHER_AID_INSTRUMENT_EN = string(),
BENEFICIARY_NAME = string(),
BENEFICIARY_NAME_ENGLISH = string(),
BENEFICIARY_NATIONAL_ID = string(),
BENEFICIARY_NAT_ID_TYPE_SD = string(),
BENEFICIARY_TYPE_SD = string(),
COUNTRY_SD = string(),
REGION_SD = string(),
SECTOR_SD = string(),
GRANTED_AMOUNT_FROM_EUR = double(),
NOMINAL_AMOUNT_EUR_FROM = double(),
GRANT_RANGE = string(),
GRANTING_AUTHORITY_NAME = string(),
GRANTING_AUTHORITY_NAME_EN = string(),
NUTS_CD = string(),
GRANTING_AUTHORITY_COUNTRY = string()
)
)
write_dataset(
data,
format = "parquet",
path = ".",
max_rows_per_file = 1e7
)
请注意,当使用open_dataset()
指定模式时,需要跳过包含列名的第一行。
在错误消息中,数据类型不正确的行号将被关闭一,因为表头包含在计数中。