我有一个.tar文件,它在一个文件夹中包含许多.gz文件。每个gz文件都包含一个.txt文件。与此问题相关的其他stackoverflow问题旨在提取文件。
我尝试迭代地读取每个.txt文件的内容,而不提取它们,因为.tar很大。
首先,我阅读了.tar文件的内容:
import tarfile
tar = tarfile.open("FILE.tar")
tar.getmembers()
或者在Unix中:
tar xvf file.tar -O
然后我尝试使用tarfile extractfile方法,但我得到了一个错误:";模块"tarfile"没有属性"extractfile";。此外,我甚至不确定这是正确的方法。
import gzip
for member in tar.getmembers():
m = tarfile.extractfile(member)
file_contents = gzip.GzipFile(fileobj=m).read()
如果您想创建一个示例文件来模拟原始文件:
$ mkdir directory
$ touch directory/file1.txt.gz directory/file2.txt.gz directory/file3.txt.gz
$ tar -c -f file.tar directory
这是在使用Mark Adler的建议后对我有效的最终版本:
import tarfile
tar = tarfile.open("file.tar")
members = tar.getmembers()
# Here I append the results in a list, because I wasn't able to
# parse the tarfile type returned by .getmembers():
tar_name = []
for elem in members:
tar_name.append(elem.name)
# Then I changed tarfile.extractfile to tar.extractfile as suggested:
for member in tar_name:
# I'm using this because I have other non-gzs in the directory
if member.endswith(".gz"):
m=tar.extractfile(member)
file_contents = gzip.GzipFile(fileobj=m).read()
您需要使用tar.extractfile(member)
而不是tarfile.extractfile(member)
。tarfile
是类,不知道您打开的tar文件。tar
是tarfile对象,它引用您打开的.tar文件。
要做到正确,请使用next()
而不是getmembers()
或getnames()
,这样您就不必读取整个tar文件两次:
with tarfile.open(sys.argv[1]) as tar:
while ent := tar.next():
if ent.name.endswith(".gz"):
print(gzip.GzipFile(fileobj=tar.extractfile(ent)).read())
下面是unix行/bash命令:
准备文件:
$ git clone https://github.com/githubtraining/hellogitworld.git
$ cd hellogitworld
$ gzip *
$ ls
build.gradle.gz fix.txt.gz pom.xml.gz README.txt.gz resources runme.sh.gz src
$ cd ..
$ tar -cf hellogitworld.tar hellogitworld/
以下是如何查看其自述文件:
$ tar -Oxf hellogitworld.tar hellogitworld/README.txt.gz | zcat
结果:
This is a sample project students can use during Matthew's Git class.
Here is an addition by me
We can have a bit of fun with this repo, knowing that we can always reset it to a known good state. We can apply labels, and branch, then add new code and merge it in to the master branch.
As a quick reminder, this came from one of three locations in either SSH, Git, or HTTPS format:
* git@github.com:matthewmccullough/hellogitworld.git
* git://github.com/matthewmccullough/hellogitworld.git
* https://matthewmccullough@github.com/matthewmccullough/hellogitworld.git
We can, as an example effort, even modify this README and change it as if it were source code for the purposes of the class.
This demo also includes an image with changes on a branch for examination of image diff on GitHub.
请注意,我与那些git存储库没有关联。
焦油的解释:
- 标志
-x
=提取 - 标志
-O
=不将文件写入文件系统,而是写入STDOUT - 标志
-f
=指定文件
然后剩下的只是将结果管道传输到zcat,以查看STDOUT 中的未压缩明文
import gzip
import tarfile
with tarfile.TarFile("data.tar", 'r') as tar_fd:
for files in tar_fd.getnames():
if files.endswith(".gz"):
file = tar_fd.extractfile(files)
file_content = gzip.GzipFile(fileobj=file).readline()
print(file_content)