我在 s3 中有一个 ORC 文件,我想将其读入 Dask 数据帧。 我正在使用conda来获取python 3.7虚拟环境,并且安装了Dask。 我的环境如下所示:
# Name Version Build Channel
appnope 0.1.0 py37_0
backcall 0.1.0 py37_0
blas 1.0 mkl
bokeh 2.0.2 py37_0
ca-certificates 2020.1.1 0
certifi 2020.4.5.1 py37_0
click 7.1.2 py_0
cloudpickle 1.4.1 py_0
cytoolz 0.10.1 py37h1de35cc_0
dask 2.17.0 py_0
dask-core 2.17.0 py_0
decorator 4.4.2 py_0
distributed 2.17.0 py37_0
entrypoints 0.3 py37_0
freetype 2.9.1 hb4e5f40_0
fsspec 0.7.1 py_0
heapdict 1.0.1 py_0
intel-openmp 2019.4 233
ipykernel 5.1.4 py37h39e3cac_0
ipython 7.13.0 py37h5ca1d4c_0
ipython_genutils 0.2.0 py37_0
jedi 0.17.0 py37_0
jinja2 2.11.2 py_0
jpeg 9b he5867d9_2
jupyter_client 6.1.3 py_0
jupyter_core 4.6.3 py37_0
libcxx 10.0.0 1
libedit 3.1.20181209 hb402a30_0
libffi 3.3 h0a44026_1
libgfortran 3.0.1 h93005f0_2
libpng 1.6.37 ha441bb4_0
libsodium 1.0.16 h3efe00b_0
libtiff 4.1.0 hcb84e12_0
locket 0.2.0 py37_1
markupsafe 1.1.1 py37h1de35cc_0
mkl 2019.4 233
mkl-service 2.3.0 py37hfbe908c_0
mkl_fft 1.0.15 py37h5e564d8_0
mkl_random 1.1.1 py37h959d312_0
msgpack-python 1.0.0 py37h04f5b5a_1
ncurses 6.2 h0a44026_1
numpy 1.18.1 py37h7241aed_0
numpy-base 1.18.1 py37h3304bdc_1
olefile 0.46 py_0
openssl 1.1.1g h1de35cc_0
packaging 20.3 py_0
pandas 1.0.3 py37h6c726b0_0
parso 0.7.0 py_0
partd 1.1.0 py_0
pexpect 4.8.0 py37_0
pickleshare 0.7.5 py37_0
pillow 7.1.2 py37h4655f20_0
pip 20.0.2 py37_3
prompt-toolkit 3.0.4 py_0
prompt_toolkit 3.0.4 0
psutil 5.7.0 py37h1de35cc_0
ptyprocess 0.6.0 py37_0
pyarrow 0.17.1 pypi_0 pypi
pygments 2.6.1 py_0
pyparsing 2.4.7 py_0
python 3.7.7 hf48f09d_4
python-dateutil 2.8.1 py_0
pytz 2020.1 py_0
pyyaml 5.3.1 py37h1de35cc_0
pyzmq 18.1.1 py37h0a44026_0
readline 8.0 h1de35cc_0
setuptools 46.4.0 py37_0
six 1.14.0 py37_0
sortedcontainers 2.1.0 py37_0
sqlite 3.31.1 h5c1f38d_1
tblib 1.6.0 py_0
tk 8.6.8 ha441bb4_0
toolz 0.10.0 py_0
tornado 6.0.4 py37h1de35cc_1
traitlets 4.3.3 py37_0
typing_extensions 3.7.4.1 py37_0
wcwidth 0.1.9 py_0
wheel 0.34.2 py37_0
xz 5.2.5 h1de35cc_0
yaml 0.1.7 hc338f04_2
zeromq 4.3.1 h0a44026_3
zict 2.0.0 py_0
zlib 1.2.11 h1de35cc_3
zstd 1.3.7 h5bba6e5_0
我试图这样做:
import dask.dataframe as dd
orders_path = "s3://bucketname/folder/ord_files_dir/"
orders = dd.read_orc(orders_path)
但是我收到此错误:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
96 try:
---> 97 return import_module(mod_name)
98 except ImportError:
/anaconda3/envs/dask_env/lib/python3.7/importlib/__init__.py in import_module(name, package)
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
128
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _gcd_import(name, package, level)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load(name, import_)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _load_unlocked(spec)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap_external.py in exec_module(self, module)
/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/pyarrow/orc.py in <module>
23 from pyarrow.lib import Schema
---> 24 import pyarrow._orc as _orc
25
ModuleNotFoundError: No module named 'pyarrow._orc'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-3-67de491f90db> in <module>
----> 1 orders = dd.read_orc(orders_path)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/dataframe/io/orc.py in read_orc(path, columns, storage_options)
46 ... 'master/examples/demo-11-zlib.orc') # doctest: +SKIP
47 """
---> 48 orc = import_required("pyarrow.orc", "Please install pyarrow >= 0.9.0")
49 import pyarrow as pa
50
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
97 return import_module(mod_name)
98 except ImportError:
---> 99 raise RuntimeError(error_msg)
100
101
RuntimeError: Please install pyarrow >= 0.9.0
据我所知,我使用的是所有相关实体的受支持版本 python=3.7 和 pyarrow>= 0.9.0。
关于下一步尝试的任何建议都会很棒!
粘贴来自 dask/dev gitter 频道的对话(感谢 @uwe-l-korn (:
对于用 pip 安装的 pyarrow,
由于链接问题,ORC 构建在车轮中被禁用:
https://github.com/apache/arrow/blob/f79a38169bd2e29b0dc2f27cf0006b9fec613774/python/manylinux201x/build_arrow.sh#L46-L48
可以通过更新这些脚本中的 ORC 和 protobuf 版本来解决 但这需要一个志愿者来研究。
那么,这个问题最简单的解决方案是使用 conda 安装 pyarrow。