r-在目标工作流中处理zip文件



我正在尝试设置一个工作流,该工作流包括下载zip文件、提取其内容,并将一个函数应用于每个文件。

我遇到了一些问题:

  1. 如何可复制地设置空文件系统?也就是说,我希望能够创建一个空目录系统,稍后将文件下载到该系统。理想情况下,我想做一些类似tar_target(my_dir, fs::dir_create("data"), format = "file")的事情,但我从文档中知道,空目录不能与format="一起使用;文件";。我知道我可以在每一个需要的实例上都做一个dir_create,但这似乎很笨拙。

  2. 在下面的reprex中,我想使用pattern = map(x)对每个文件进行单独操作。正如错误所示,我需要为父目标指定一个模式,因为format = "file"。您可以看到,如果我确实为父目标指定了一个模式,我将再次需要为父目标指定该模式。据我所知,无法为没有父母的目标设定模式(但我以前错了很多次(。

我有一种感觉,我做错了——谢谢你抽出时间。

library(targets)
tar_script({
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
fs::file_copy(file_path, file_path, overwrite = TRUE)
}
list(
tar_target(downloaded_zip, 
download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip", 
path(dir_create("data"), "file", ext = "zip")), 
format = "file"), 

tar_target(extracted_files, 
unzip(downloaded_zip, exdir = dir_create("data")), 
format = "file"), 
tar_target(stuff_done, 
do_stuff(extracted_files), 
pattern = map(extracted_files), format = "file", 
iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#> 
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

创建于2021-12-08由reprex包(v2.0.1(

原始答案

这里有一个想法:您可以使用format = "url"跟踪该URL,然后使该URL成为所有文件分支的依赖项。下面,所有files都应该重新运行,然后上游的在线数据发生变化。这很好,因为所做的只是重新散列。但是,如果只有其中一些文件实际发生了更改,则不应该运行stuff_done的所有分支。

编辑

仔细想想,我们可能需要对所有本地文件进行批量散列。不是最有效的,但它能完成任务。targets希望您使用自己的内置存储系统,而不是外部文件,因此如果您可以读取数据并以非文件格式返回数据,则动态分支将更容易。

# _targets.R file
library(targets)
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
download.file(url, dest)
dest
}
do_stuff <- function(file_path) {
file.info(file_path)
}
download_and_unzip <- function(url) {
downloaded_zip <- tempfile()
download_file(url, downloaded_zip)
unzip(downloaded_zip, exdir = dir_create("data"))
}
list(
tar_target(
url,
"https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
format = "url"
),
tar_target(
files_bulk,
download_and_unzip(url),
format = "file"
),
tar_target(file_names, files_bulk), # not a format = "file" target
tar_target(
files, {
files-bulk # Re-hash all the files separately if any file changes.
file_names
},
pattern = map(file_names),
format = "file"
),
tar_target(stuff_done, do_stuff(files), pattern = map(files))
)

最新更新