是否有可能只下载ZIP存档的一部分(例如一个文件)



是否有一种方法可以让我只下载。rar或。zip文件的一部分,而不下载整个文件?

存在一个ZIP文件,其中包含文件a、B、C和D。我只需要A,我可以通过某种方式调整下载,只下载A,或者如果可能的话,提取服务器本身的文件,只获得A吗?

技巧是按照Sergio的建议去做,而不是手动去做。如果您通过http支持的虚拟文件系统挂载ZIP文件,然后在其上使用标准的unzip命令,这很容易。通过这种方式,unzip实用程序的I/O调用被转换为HTTP范围的get,这意味着只有您希望通过网络传输的ZIP文件块。

下面是一个使用HTTPFS的Linux示例,这是一个非常轻量级的虚拟文件系统(它使用FUSE)。Windows上也有类似的工具。

/构建httpfs:

$ wget http://sourceforge.net/projects/httpfs/files/httpfs/1.06.07.02
$ mv 1.06.07.10 httpfs_1.06.07.10.tar.bz2
$ tar -xjf httpfs_1.06.07.10.tar.bz2
$ rm httpfs
$ ./make_httpfs

挂载一个远程ZIP文件并从中提取一个文件:

$ mkdir mount_pt
$ sudo ./httpfs http://server.com/zipfile.zip mount_pt
$ sudo ls mount_pt
zipfile.zip
$ sudo unzip -p mount_pt/zipfile.zip the_file_I_want.txt > the_file_I_want.txt
$ sudo umount mount_pt

当然,您也可以使用命令行之外的任何其他工具(我需要sudo,因为似乎FUSE在我的机器上是这样设置的,您不应该需要它)。

在某种程度上,你可以。

ZIP文件格式表示有一个"中央目录"。基本上,这是一个表,用于存储存档中的文件以及它们的偏移量。

因此,使用Content-Range可以从末尾下载文件的一部分(中央目录是ZIP文件中的最后一个内容),并尝试识别其中的中心目录。如果你成功了,那么你就知道了文件列表和偏移量,所以你可以继续分别获取这些块并自己解压缩它们。

这种方法很容易出错,不能保证有效。但黑客也是如此:-)

另一种可能的方法是为此构建一个自定义服务器(有关更多细节,请参阅pst的回答)。

普通人有几种方法可以从压缩的ZIP文件中下载单个文件,不幸的是,它们不是常识。有一些开源工具和在线web服务,包括:

  • Windows: Iczelion的HTTP Zip下载程序(开源)(我已经使用了10多年!)
  • Linux: partial-zip(开源)
  • 在线:wobzip.org(闭源)

您可以安排您的文件出现在ZIP文件的后面。

下载100 k:

$ curl -r -100000 https://www.keepassx.org/releases/2.0.2/KeePassX-2.0.2.zip -o tail.zip
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100   97k  100   97k    0     0  84739      0  0:00:01  0:00:01 --:--:-- 84817

检查我们得到了哪些文件:

$ unzip -t tail.zip
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
error [tail.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
    testing: KeePassX-2.0.2/share/translations/keepassx_uk.qm   OK
    testing: KeePassX-2.0.2/share/translations/keepassx_zh_CN.qm   OK
    testing: KeePassX-2.0.2/share/translations/keepassx_zh_TW.qm   OK
    testing: KeePassX-2.0.2/zlib1.dll   OK
At least one error was detected in tail.zip.

然后解压缩最后一个文件:

$ unzip tail.zip KeePassX-2.0.2/zlib1.dll
Archive:  tail.zip
error [tail.zip]:  missing 7751495 bytes in zipfile
  (attempting to process anyway)
  inflating: KeePassX-2.0.2/zlib1.dll

我认为Sergio Tulentsev的想法非常棒。

然而,如果对服务器有控制——例如,可以部署自定义代码——那么映射/处理请求,提取ZIP存档的相关部分,并将数据发送回HTTP流是一个相当琐碎的操作(在事情的方案中)。

请求可能看起来像:

http://foo.bar/myfile.zip_a.jpeg

这意味着提取——并返回——"a.jpeg"从"myfile.zip".

(我故意选择这种愚蠢的格式,以便浏览器可能会选择"myfile.zip_a.jpeg"作为出现在下载对话框中的名称)

当然,如何实现取决于服务器/语言/框架,可能已经存在支持类似操作的解决方案(但我不知道)。

基于良好的输入,我在Powershell中编写了一个代码片段来展示它是如何工作的:

# demo code downloading a single DLL file from an online ZIP archive
# and extracting the DLL into memory to mount it finally to the main process.
cls
Remove-Variable * -ea 0
# definition for the ZIP archive, the file to be extracted and the checksum:
$url = 'https://github.com/sshnet/SSH.NET/releases/download/2020.0.1/SSH.NET-2020.0.1-bin.zip'
$sub = 'net40/Renci.SshNet.dll'
$md5 = '5B1AF51340F333CD8A49376B13AFCF9C'
# prepare HTTP client:
Add-Type -AssemblyName System.Net.Http
$handler = [System.Net.Http.HttpClientHandler]::new()
$client  = [System.Net.Http.HttpClient]::new($handler)
# get the length of the ZIP archive:
$req = [System.Net.HttpWebRequest]::Create($url)
$req.Method = 'HEAD'
$length = $req.GetResponse().ContentLength
$zip = [byte[]]::new($length)
# get the last 10k:
# how to get the correct length of the central ZIP directory here?
$start = $length-10kb
$end   = $length-1
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$last10kb = $result.content.ReadAsByteArrayAsync().Result
$last10kb.CopyTo($zip, $start)
# get the block containing the DLL file:
# how to get the exact file-offset from the ZIP directory?
$start = $length-3537kb
$end   = $length-3201kb
$client.DefaultRequestHeaders.Clear()
$client.DefaultRequestHeaders.Add('Range', "bytes=$start-$end")
$result = $client.GetAsync($url).Result
$block = $result.content.ReadAsByteArrayAsync().Result
$block.CopyTo($zip, $start)
# extract the DLL file from archive:
Add-Type -AssemblyName System.IO.Compression
$stream = [System.IO.Memorystream]::new()
$stream.Write($zip,0,$zip.Length)
$archive = [System.IO.Compression.ZipArchive]::new($stream)
$entry = $archive.GetEntry($sub)
$bytes = [byte[]]::new($entry.Length)
[void]$entry.Open().Read($bytes, 0, $bytes.Length)
# check MD5:
$prov = [Security.Cryptography.MD5CryptoServiceProvider]::new().ComputeHash($bytes)
$hash = [string]::Concat($prov.foreach{$_.ToString("x2")})
if ($hash -ne $md5) {write-host 'dll has wrong checksum.' -f y ;break}
# load the DLL:
[void][System.Reflection.Assembly]::Load($bytes)
# use the single demo-call from the DLL:
$test = [Renci.SshNet.NoneAuthenticationMethod]::new('test')
'done.'

相关内容

最新更新