在 Python 中从 "file" 确定 'file.read()' 结果的类型

我有一些代码在Python中的file对象上运行。

在 Python3 的字符串/字节革命之后，如果file以二进制模式打开，file.read()返回bytes。相反，如果file以文本模式打开，则file.read()返回str。

在我的代码中，file.read()被多次调用，因此每次调用file.read()时检查结果是不切实际的type，例如：

def foo(file_obj):
while True:
data = file.read(1)
if not data:
break
if isinstance(data, bytes):
# do something for bytes
...
else:  # isinstance(data, str)
# do something for str
...

相反，我想要的是一些可靠地检查file.read()结果的方法，例如：

def foo(file_obj):
if is_binary_file(file_obj):
# do something for bytes
while True:
data = file.read(1)
if not data:
break
...
else:
# do something for str
while True:
data = file.read(1)
if not data:
break
...

一种可能的方法是检查file_obj.mode例如：

import io

def is_binary_file(file_obj):
return 'b' in file_obj.mode

print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# AttributeError: '_io.StringIO' object has no attribute 'mode'
print(is_binary_file(io.BytesIO(b'ciao')))
# AttributeError: '_io.BytesIO' object has no attribute 'mode'

对于来自io.StringIO()和io.BytesIO()等io的对象，这将失败。

另一种也适用于io对象的方法是检查encoding属性，例如：

import io

def is_binary_file(file_obj):
return not hasattr(file_obj, 'encoding')

print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False 
print(is_binary_file(io.BytesIO(b'ciao')))
# True

有没有更干净的方法来执行此检查？

我在 astropy 中有一个版本(对于 Python 3，尽管出于某种原因需要在旧版本的 Astropy 中找到 Python 2 版本(。

它并不漂亮，但在大多数情况下它足够可靠地工作(我删除了检查.binary属性的部分，因为它仅适用于 Astropy 中的类(：

def fileobj_is_binary(f):
"""
Returns True if the give file or file-like object has a file open in binary
mode.  When in doubt, returns True by default.
"""
if isinstance(f, io.TextIOBase):
return False
mode = fileobj_mode(f)
if mode:
return 'b' in mode
else:
return True

其中fileobj_mode是：

def fileobj_mode(f):
"""
Returns the 'mode' string of a file-like object if such a thing exists.
Otherwise returns None.
"""
# Go from most to least specific--for example gzip objects have a 'mode'
# attribute, but it's not analogous to the file.mode attribute
# gzip.GzipFile -like
if hasattr(f, 'fileobj') and hasattr(f.fileobj, 'mode'):
fileobj = f.fileobj
# astropy.io.fits._File -like, doesn't need additional checks because it's
# already validated
elif hasattr(f, 'fileobj_mode'):
return f.fileobj_mode
# PIL-Image -like investigate the fp (filebuffer)
elif hasattr(f, 'fp') and hasattr(f.fp, 'mode'):
fileobj = f.fp
# FILEIO -like (normal open(...)), keep as is.
elif hasattr(f, 'mode'):
fileobj = f
# Doesn't look like a file-like object, for example strings, urls or paths.
else:
return None
return _fileobj_normalize_mode(fileobj)

def _fileobj_normalize_mode(f):
"""Takes care of some corner cases in Python where the mode string
is either oddly formatted or does not truly represent the file mode.
"""
mode = f.mode
# Special case: Gzip modes:
if isinstance(f, gzip.GzipFile):
# GzipFiles can be either readonly or writeonly
if mode == gzip.READ:
return 'rb'
elif mode == gzip.WRITE:
return 'wb'
else:
return None  # This shouldn't happen?
# Sometimes Python can produce modes like 'r+b' which will be normalized
# here to 'rb+'
if '+' in mode:
mode = mode.replace('+', '')
mode += '+'
return mode

您可能还想为io.BytesIO添加特殊情况。同样，丑陋，但在大多数情况下都有效。如果有更简单的方法，那就太好了。

多做一会儿功课后，我大概可以回答自己的问题了。

首先，一般评论：检查是否存在属性/方法作为整个 API 的标志不是一个好主意，因为它会导致更复杂且仍然相对不安全的代码。

遵循EAFP/鸭子打字的思维方式，检查特定方法可能是可以的，但它应该是随后在代码中使用的方法。

file.read()的问题(file.write()更是如此(是它带有副作用，使得尝试使用它并看看会发生什么是不切实际的。

对于这种特定情况，在仍然遵循鸭子打字思维的同时，可以利用read()的第一个参数可以设置为0的事实。这实际上不会从缓冲区读取任何内容(并且不会更改file.tell()的结果(，但它会给出一个空的str或bytes。因此，可以写这样的东西：

def is_reading_bytes(file_obj):
return isinstance(file_obj.read(0), bytes)

print(is_reading_bytes(open('test_file', 'r')))
# False
print(is_reading_bytes(open('test_file', 'rb')))
# True
print(is_reading_bytes(io.StringIO('ciao')))
# False 
print(is_reading_bytes(io.BytesIO(b'ciao')))
# True

同样，可以尝试为write()方法编写一个空的bytes字符串b''：

def is_writing_bytes(file_obj)
try:
file_obj.write(b'')
except TypeError:
return False
else:
return True

print(is_writing_bytes(open('test_file', 'w')))
# False
print(is_writing_bytes(open('test_file', 'wb')))
# True
print(is_writing_bytes(io.StringIO('ciao')))
# False 
print(is_writing_bytes(io.BytesIO(b'ciao')))
# True

请注意，这些方法不会检查可读性/可写性。

最后，可以通过检查类似文件的对象 API 来实现适当的类型检查方法。 Python 中的类似文件的对象必须支持io模块中描述的 API。在文档中提到TextIOBase用于以文本模式打开的文件，而BufferedIOBase(或RawIOBase用于未缓冲的流(用于以二进制模式打开的文件。类层次结构摘要指示两者都是从IOBase的子类化。因此，以下内容可以解决问题(请记住，isinstance()也检查子类(：

def is_binary_file(file_obj):
return isinstance(file_obj, io.IOBase) and not isinstance(file_obj, io.TextIOBase)

print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(open('test_file', 'r')))
# False
print(is_binary_file(open('test_file', 'rb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False 
print(is_binary_file(io.BytesIO(b'ciao')))
# True

请注意，文档明确指出TextIOBase将有一个encoding参数，这对于二进制文件对象不是必需的(即它不存在(。因此，使用当前的 API，在假设测试的对象类似于文件的情况下，检查encoding属性可能是一个方便的技巧，可以检查文件对象是否是标准类的二进制对象。检查mode属性仅适用于FileIO对象，并且mode属性不是IOBase/RawIOBase接口的一部分，这就是为什么它不适用于io.StringIO()/is.BytesIO()对象的原因。

相关内容

最新更新

热门标签：