为什么 file_object.tell() 在不同位置为文件提供相同的字节?



刚开始进入python,我无法绕过基本的文件导航方法。

当我阅读tell()教程时,它指出它返回我当前在文件上的位置(以字节为单位)。

我的理由是文件的每个字符都会加起来到字节坐标,对吧?这意味着在换行之后,这只是一串在n字符上拆分的字符,我的字节坐标会改变......但这似乎是不正确的。

我在 bash 上生成一个快速玩具文本文件

$ for i in {1..10}; do echo "@ this is the "$i"th line" ; done > toy.txt
$ for i in {11..20}; do echo " this is the "$i"th line" ; done >> toy.txt

现在我将遍历此文件并打印出行号,并在每个周期中打印出tell()调用的结果。@用于标记一些分隔文件块的行,我想返回这些行(见下文)。

我的猜测是 for 循环首先遍历文件对象,到达它的末尾,因此它始终保持不变。

这是玩具示例,在我的实际问题上,文件的长度是 Gigs,通过应用相同的方法,我得到了tell()的结果,我的图像块反映了 for 循环如何迭代文件对象。 这是对的吗?你能谈谈我缺少的概念吗?

我的最终目标是能够在文件中找到特定的坐标,然后从分布式起点并行处理这些巨大的文件,我无法以筛选它们的方式监控它们。

os.path.getsize("toy.txt")
451
fa = open("toy.txt")
fa.seek(0) # let's double check
fa.tell()
count = 0
for line in fa:
if line.startswith("@"):
print line ,
print "tell {} count {}".format(fa.tell(), count)
else:
if count < 32775:
print line,
print "tell {} count {}".format(fa.tell(), count)
count += 1

输出:

@ this is the 1th line
tell 451 count 0
@ this is the 2th line
tell 451 count 1
@ this is the 3th line
tell 451 count 2
@ this is the 4th line
tell 451 count 3
@ this is the 5th line
tell 451 count 4
@ this is the 6th line
tell 451 count 5
@ this is the 7th line
tell 451 count 6
@ this is the 8th line
tell 451 count 7
@ this is the 9th line
tell 451 count 8
@ this is the 10th line
tell 451 count 9
this is the 11th line
tell 451 count 10
this is the 12th line
tell 451 count 11
this is the 13th line
tell 451 count 12
this is the 14th line
tell 451 count 13
this is the 15th line
tell 451 count 14
this is the 16th line
tell 451 count 15
this is the 17th line
tell 451 count 16
this is the 18th line
tell 451 count 17
this is the 19th line
tell 451 count 18
this is the 20th line
tell 451 count 19

您正在使用for循环逐行读取文件:

for line in fa:

文件通常不会这样做;您读取数据 blob,通常是块。为了让 Python 给你换行,你需要读到下一个换行符。只是,逐字节读取以查找换行符效率不高。

因此,使用缓冲区;您读取一个大块,然后在该块中找到换行符,并为找到的每个换行符生成一行。缓冲区用尽后,读取一个新块。

你的文件不够大,无法读取多个块;它只有451字节小,而缓冲区通常以千字节为单位。如果要创建较大的文件,则在迭代时,您将看到文件位置以较大的步骤跳跃。

请参阅文档file.next(next是负责在迭代时生成下一行的方法,for循环的作用):

为了使 for 循环成为遍历文件行的最有效方式(一种非常常见的操作),next()方法使用隐藏的预读缓冲区。

如果需要在循环遍历行时跟踪绝对文件位置,则必须在 Windows 上使用二进制模式(以防止发生换行转换),并自己跟踪行长度:

position = 0    
for line in fa:
position += len(line)

另一种方法是使用io库;这是 Python 3 中用于处理文件的框架。file.tell()方法考虑缓冲区,即使在迭代时也会生成准确的文件位置。

请注意,当您使用io.open()文本模式打开文件时,您将获得unicode字符串。在Python 2中,如果你必须有str字节串,你可以只使用二进制模式(用'rb'打开)。实际上,只有在二进制模式下,您才能访问IOBase.tell(),在文本模式下会引发异常:

>>> import io
>>> fa = io.open("toy.txt")
>>> next(fa)
u'@ this is the 1th linen'
>>> fa.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: telling position disabled by next() call

在二进制模式下,您可以获得file.tell()的准确输出:

>>> import os.path
>>> os.path.getsize("toy.txt")
461
>>> fa = io.open("toy.txt", 'rb')
>>> for line in fa:
...     if line.startswith("@"):
...         print line ,
...         print "tell {} count {}".format(fa.tell(), count)
...     else:
...         if count < 32775:
...             print line,
...             print "tell {} count {}".format(fa.tell(), count)
...     count += 1
...
@ this is the 1th line
tell 23 count 0
@ this is the 2th line
tell 46 count 1
@ this is the 3th line
tell 69 count 2
@ this is the 4th line
tell 92 count 3
@ this is the 5th line
tell 115 count 4
@ this is the 6th line
tell 138 count 5
@ this is the 7th line
tell 161 count 6
@ this is the 8th line
tell 184 count 7
@ this is the 9th line
tell 207 count 8
@ this is the 10th line
tell 231 count 9
this is the 11th line
tell 254 count 10
this is the 12th line
tell 277 count 11
this is the 13th line
tell 300 count 12
this is the 14th line
tell 323 count 13
this is the 15th line
tell 346 count 14
this is the 16th line
tell 369 count 15
this is the 17th line
tell 392 count 16
this is the 18th line
tell 415 count 17
this is the 19th line
tell 438 count 18
this is the 20th line
tell 461 count 19

循环访问文件时,它使用内部缓冲区来最大程度地减少昂贵的 IO 操作,因此文件不一定位于循环看到的最后一个字符。