使用 python 访问 HDInsight Hive



我们有一个HDInsight群集,在HIVE中有一些表。 我想从客户端计算机(Azure 外部)从 Python 3.6 查询这些表。

我尝试过使用PyHivepyhs2impyla但是我遇到了各种问题。

有没有人有一个从 Python 访问HDInsight HIVE的工作示例?

我对此几乎没有经验,也不知道如何配置PyHive(这似乎是最有希望的),尤其是在授权方面。

impyla

from impala.dbapi import connect
conn = connect(host='redacted.azurehdinsight.net',port=443)
cursor = conn.cursor()
cursor.execute('SELECT * FROM cs_test LIMIT 100')
print(cursor.description)  # prints the result set's schema
results = cursor.fetchall()

这给出了:

Traceback (most recent call last):
File "C:/git/ml-notebooks/impyla.py", line 3, in <module>
cursor = conn.cursor()
File "C:UserschrisAnaconda3libsite-packagesimpalahiveserver2.py", line 125, in cursor
session = self.service.open_session(user, configuration)
File "C:UserschrisAnaconda3libsite-packagesimpalahiveserver2.py", line 995, in open_session
resp = self._rpc('OpenSession', req)
File "C:UserschrisAnaconda3libsite-packagesimpalahiveserver2.py", line 923, in _rpc
response = self._execute(func_name, request)
File "C:UserschrisAnaconda3libsite-packagesimpalahiveserver2.py", line 954, in _execute
.format(self.retries))
impala.error.HiveServer2Error: Failed after retrying 3 times

Pyhive

from pyhive import hive
conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
#also tried other auth-types, but as i said, i have no clue here

这给出了:

Traceback (most recent call last):
File "C:/git/ml-notebooks/PythonToHive.py", line 3, in <module>
conn = hive.connect(host="redacted.azurehdinsight.net",port=443,auth="NOSASL")
File "C:UserschrisAnaconda3libsite-packagespyhivehive.py", line 64, in connect
return Connection(*args, **kwargs)
File "C:UserschrisAnaconda3libsite-packagespyhivehive.py", line 164, in __init__
response = self._client.OpenSession(open_session_req)
File "C:UserschrisAnaconda3libsite-packagesTCLIServiceTCLIService.py", line 187, in OpenSession
return self.recv_OpenSession()
File "C:UserschrisAnaconda3libsite-packagesTCLIServiceTCLIService.py", line 199, in recv_OpenSession
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "C:UserschrisAnaconda3libsite-packagesthriftprotocolTBinaryProtocol.py", line 134, in readMessageBegin
sz = self.readI32()
File "C:UserschrisAnaconda3libsite-packagesthriftprotocolTBinaryProtocol.py", line 217, in readI32
buff = self.trans.readAll(4)
File "C:UserschrisAnaconda3libsite-packagesthrifttransportTTransport.py", line 60, in readAll
chunk = self.read(sz - have)
File "C:UserschrisAnaconda3libsite-packagesthrifttransportTTransport.py", line 161, in read
self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "C:UserschrisAnaconda3libsite-packagesthrifttransportTSocket.py", line 117, in read
buff = self.handle.recv(sz)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

根据官方文件理解和解决从HDInsight上的WebHCat收到的错误,它说如下。

什么是WebHCat

WebHCat是用于HCatalog的REST API,用于Hadoop的表和存储管理层。WebHCat在 HDInsight 群集上默认启用,各种工具使用它来提交作业、获取作业状态等,而无需登录到群集。

所以一个解决方法是使用 WebHCat 在 Python 中运行 Hive QL,请参考 Hive 文档来学习和使用它。作为参考,有一个类似的 MSDN 线程讨论了它。

希望对您有所帮助。

从技术上讲,您应该能够使用Thrift连接器和pyhive,但我在这方面没有任何成功。但是,我已经成功地使用了使用JayDeBeAPI的JDBC连接器。

首先,您需要下载 JDBC 驱动程序。

  • http://central.maven.org/maven2/org/apache/hive/hive-jdbc/1.2.1/hive-jdbc-1.2.1-standalone.jar
  • http://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.4/httpclient-4.4.jar

  • http://central.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.4/httpcore-4.4.4.jar

我把我的放在/jdbc,并使用了JayDeBeAPI和以下连接字符串。

编辑:您需要将/jdbc/*添加到CLASSPATH环境变量中。

import jaydebeapi
conn = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver",
"jdbc:hive2://my_ip_or_url:443/;ssl=true;transportMode=http;httpPath=/hive2", 
[username, password],
"/jdbc/hive-jdbc-1.2.1.jar")

最新更新