我当前正在迁移旧的Arrow文件系统接口:
http://arrow.apache.org/docs/python/filesystems_deprecated.html
到新的文件系统接口:
http://arrow.apache.org/docs/python/filesystems.html
我正在尝试使用fs连接到HDFS。HadoopFileSystem如下
from pyarrow import fs
import os
os.environ['HADOOP_HOME'] = '/usr/hdp/current/hadoop-client'
os.environ['JAVA_HOME'] = '/opt/jdk8'
os.environ['ARROW_LIBHDFS_DIR'] = '/usr/lib/ams-hbase/lib/hadoop-native'
fs.HadoopFileSystem("hdfs://namenode:8020?user=hdfsuser")
我尝试了不同的uri组合,还用fs替换了uri。HdfsOptions:
connection_tuple = ("namenode", 8020)
fs.HadoopFileSystem(fs.HdfsOptions(connection_tuple, user="hdfsuser"))
以上所有的都给我带来了同样的错误:
Environment variable CLASSPATH not set!
getJNIEnv: getGlobalJNIEnv failed
Environment variable CLASSPATH not set!
getJNIEnv: getGlobalJNIEnv failed
/arrow/cpp/src/arrow/filesystem/hdfs.cc:56: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_hdfs.pyx", line 180, in pyarrow._hdfs.HadoopFileSystem.__init__
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: HDFS connection failed
没有太多的文档,因为这个功能是很新的,所以希望我能在这里得到一些答案
干杯
设置HDFS类路径环境
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
找到hdfs-bin目录以设置此变量
相关问题
- 如何在Windows上正确设置pyarrow for python 3.7