用于读取sas.xpt文件的pandas.read_sas()不适用于存储在谷歌云存储(GCS)中的文件



我正在尝试读取。XPT文件转换为Pandas DataFrame。如果文件是本地文件,则此项功能有效,但如果文件存储在GCS中,则此功能无效。

我使用将样本数据上传到GCS

!curl -L https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT | gsutil cp - gs://my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT

我还使用在本地下载了该文件

mkdir data
!curl https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT -o data/DEMO_J.XPT

我尝试过以下GCS,但都不起作用:

import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT') as f:
df = pd.read_sas(f,format='xport')
import pandas as pd
filepath = 'gs://my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT'
df = pd.read_sas(filepath, format='xport', encoding='utf-8')
df.head(10)

它们都返回以下错误:

/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py in __init__(self, filepath_or_buffer, index, encoding, chunksize)
278             contents = filepath_or_buffer.read()
279             try:
--> 280                 contents = contents.encode(self._encoding)
281             except UnicodeEncodeError:
282                 pass
AttributeError: 'bytes' object has no attribute 'encode'

现在也尝试了TensorFlow,但它不起作用:

from tensorflow.python.lib.io import file_io
import pandas as pd
filepath = 'gs://my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT'
with file_io.FileIO(filepath, 'r') as f:
# ISO-8859-1
# utf-8
# utf-16
# latin-1
df = pd.read_sas(f, format='xport', encoding='utf-8')
df.head(5)

返回错误:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-60-fb02f0706587> in <module>
10     # utf-16
11     # latin-1
---> 12     df = pd.read_sas(f, format='xport', encoding='utf-8')
13 
14 df.head(5)
/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
68 
69         reader = XportReader(
---> 70             filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
71         )
72     elif format.lower() == "sas7bdat":
/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py in __init__(self, filepath_or_buffer, index, encoding, chunksize)
276         else:
277             # Copy to BytesIO, and ensure no encoding
--> 278             contents = filepath_or_buffer.read()
279             try:
280                 contents = contents.encode(self._encoding)
/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py in read(self, n)
126       length = n
127     return self._prepare_value(
--> 128         pywrap_tensorflow.ReadFromStream(self._read_buf, length))
129 
130   @deprecation.deprecated_args(
/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py in _prepare_value(self, val)
96       return compat.as_bytes(val)
97     else:
---> 98       return compat.as_str_any(val)
99 
100   def size(self):
/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/util/compat.py in as_str_any(value)
137   """
138   if isinstance(value, bytes):
--> 139     return as_str(value)
140   else:
141     return str(value)
/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/util/compat.py in as_text(bytes_or_text, encoding)
107     return bytes_or_text
108   elif isinstance(bytes_or_text, bytes):
--> 109     return bytes_or_text.decode(encoding)
110   else:
111     raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 2967: invalid start byte

但是,当文件是本地文件时,以下内容可以正常工作:

import pandas as pd
filepath = 'data/DEMO_J.XPT'
df = pd.read_sas(filepath, format='xport', encoding='utf-8')
df.head(10)

看起来代码并没有真正为Python 3更新。您可以尝试通过删除.encode('utf-8'(来修复该库,因为您在Python3中不需要它。请参阅:https://docs.python.org/3/library/stdtypes.html#binary-序列类型字节字节数组内存查看

或者,您可以使用tensorflow而不是gcs-fuse:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://my-bucket/.../DEMO_J.XPT', 'r') as f:
df = pd.read_sas(f, format='xport')

这是一个错误33069熊猫。SAS IO连接器错误地假设所有文件缓冲区都以文本模式打开。

我用以下更改修补了本地site-packages/pandas/io/sas/sas_xport.py,并能够读取数据帧:

class XportReader(BaseIterator):
__doc__ = _xport_reader_doc
def __init__(
self, filepath_or_buffer, index=None, encoding="ISO-8859-1", chunksize=None
):
self._encoding = encoding
self._lines_read = 0
self._index = index
self._chunksize = chunksize
if isinstance(filepath_or_buffer, str):
(
filepath_or_buffer,
encoding,
compression,
should_close,
) = get_filepath_or_buffer(filepath_or_buffer, encoding=encoding)
if isinstance(filepath_or_buffer, (str, bytes)):
self.filepath_or_buffer = open(filepath_or_buffer, "rb")
else:
# Copy to BytesIO, and ensure no encoding
contents = filepath_or_buffer.read()
try:
# NEW LINE HERE: Don't convert to binary if it's already bytes.
if hasattr(contents, "encode"):
contents = contents.encode(self._encoding)
except UnicodeEncodeError:
pass
self.filepath_or_buffer = BytesIO(contents)
self._read_header()

PR 33070处于挂起状态,修复了此问题。合并后,pandas 1.1.0发布后,将不再需要手动补丁。

最新更新