尝试将 ORC 读取为 Dask 数据帧



我在 s3 中有一个 ORC 文件,我想将其读入 Dask 数据帧。 我正在使用conda来获取python 3.7虚拟环境,并且安装了Dask。 我的环境如下所示:

# Name                    Version                   Build  Channel
appnope                   0.1.0                    py37_0  
backcall                  0.1.0                    py37_0  
blas                      1.0                         mkl  
bokeh                     2.0.2                    py37_0  
ca-certificates           2020.1.1                      0  
certifi                   2020.4.5.1               py37_0  
click                     7.1.2                      py_0  
cloudpickle               1.4.1                      py_0  
cytoolz                   0.10.1           py37h1de35cc_0  
dask                      2.17.0                     py_0  
dask-core                 2.17.0                     py_0  
decorator                 4.4.2                      py_0  
distributed               2.17.0                   py37_0  
entrypoints               0.3                      py37_0  
freetype                  2.9.1                hb4e5f40_0  
fsspec                    0.7.1                      py_0  
heapdict                  1.0.1                      py_0  
intel-openmp              2019.4                      233  
ipykernel                 5.1.4            py37h39e3cac_0  
ipython                   7.13.0           py37h5ca1d4c_0  
ipython_genutils          0.2.0                    py37_0  
jedi                      0.17.0                   py37_0  
jinja2                    2.11.2                     py_0  
jpeg                      9b                   he5867d9_2  
jupyter_client            6.1.3                      py_0  
jupyter_core              4.6.3                    py37_0  
libcxx                    10.0.0                        1  
libedit                   3.1.20181209         hb402a30_0  
libffi                    3.3                  h0a44026_1  
libgfortran               3.0.1                h93005f0_2  
libpng                    1.6.37               ha441bb4_0  
libsodium                 1.0.16               h3efe00b_0  
libtiff                   4.1.0                hcb84e12_0  
locket                    0.2.0                    py37_1  
markupsafe                1.1.1            py37h1de35cc_0  
mkl                       2019.4                      233  
mkl-service               2.3.0            py37hfbe908c_0  
mkl_fft                   1.0.15           py37h5e564d8_0  
mkl_random                1.1.1            py37h959d312_0  
msgpack-python            1.0.0            py37h04f5b5a_1  
ncurses                   6.2                  h0a44026_1  
numpy                     1.18.1           py37h7241aed_0  
numpy-base                1.18.1           py37h3304bdc_1  
olefile                   0.46                       py_0  
openssl                   1.1.1g               h1de35cc_0  
packaging                 20.3                       py_0  
pandas                    1.0.3            py37h6c726b0_0  
parso                     0.7.0                      py_0  
partd                     1.1.0                      py_0  
pexpect                   4.8.0                    py37_0  
pickleshare               0.7.5                    py37_0  
pillow                    7.1.2            py37h4655f20_0  
pip                       20.0.2                   py37_3  
prompt-toolkit            3.0.4                      py_0  
prompt_toolkit            3.0.4                         0  
psutil                    5.7.0            py37h1de35cc_0  
ptyprocess                0.6.0                    py37_0  
pyarrow                   0.17.1                   pypi_0    pypi
pygments                  2.6.1                      py_0  
pyparsing                 2.4.7                      py_0  
python                    3.7.7                hf48f09d_4  
python-dateutil           2.8.1                      py_0  
pytz                      2020.1                     py_0  
pyyaml                    5.3.1            py37h1de35cc_0  
pyzmq                     18.1.1           py37h0a44026_0  
readline                  8.0                  h1de35cc_0  
setuptools                46.4.0                   py37_0  
six                       1.14.0                   py37_0  
sortedcontainers          2.1.0                    py37_0  
sqlite                    3.31.1               h5c1f38d_1  
tblib                     1.6.0                      py_0  
tk                        8.6.8                ha441bb4_0  
toolz                     0.10.0                     py_0  
tornado                   6.0.4            py37h1de35cc_1  
traitlets                 4.3.3                    py37_0  
typing_extensions         3.7.4.1                  py37_0  
wcwidth                   0.1.9                      py_0  
wheel                     0.34.2                   py37_0  
xz                        5.2.5                h1de35cc_0  
yaml                      0.1.7                hc338f04_2  
zeromq                    4.3.1                h0a44026_3  
zict                      2.0.0                      py_0  
zlib                      1.2.11               h1de35cc_3  
zstd                      1.3.7                h5bba6e5_0 

我试图这样做:

import dask.dataframe as dd
orders_path = "s3://bucketname/folder/ord_files_dir/"
orders = dd.read_orc(orders_path)

但是我收到此错误:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
96     try:
---> 97         return import_module(mod_name)
98     except ImportError:
/anaconda3/envs/dask_env/lib/python3.7/importlib/__init__.py in import_module(name, package)
126             level += 1
--> 127     return _bootstrap._gcd_import(name[level:], package, level)
128 
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _gcd_import(name, package, level)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load(name, import_)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _load_unlocked(spec)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap_external.py in exec_module(self, module)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/pyarrow/orc.py in <module>
23 from pyarrow.lib import Schema
---> 24 import pyarrow._orc as _orc
25 
ModuleNotFoundError: No module named 'pyarrow._orc'
During handling of the above exception, another exception occurred:
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-67de491f90db> in <module>
----> 1 orders = dd.read_orc(orders_path)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/dataframe/io/orc.py in read_orc(path, columns, storage_options)
46     ...                  'master/examples/demo-11-zlib.orc')  # doctest: +SKIP
47     """
---> 48     orc = import_required("pyarrow.orc", "Please install pyarrow >= 0.9.0")
49     import pyarrow as pa
50 
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
97         return import_module(mod_name)
98     except ImportError:
---> 99         raise RuntimeError(error_msg)
100 
101 
RuntimeError: Please install pyarrow >= 0.9.0

据我所知,我使用的是所有相关实体的受支持版本 python=3.7 和 pyarrow>= 0.9.0。

关于下一步尝试的任何建议都会很棒!

粘贴来自 dask/dev gitter 频道的对话(感谢 @uwe-l-korn (:

对于用 pip 安装的 pyarrow,

由于链接问题,ORC 构建在车轮中被禁用:

https://github.com/apache/arrow/blob/f79a38169bd2e29b0dc2f27cf0006b9fec613774/python/manylinux201x/build_arrow.sh#L46-L48

可以通过更新这些脚本中的 ORC 和 protobuf 版本来解决 但这需要一个志愿者来研究。

那么,这个问题最简单的解决方案是使用 conda 安装 pyarrow。

最新更新