执行spark-submit应用程序到yarn和远程CDH kerberized env时出现pyspark-hiveC



错误

airflow@41166b660d82:~$ spark-submit --master yarn --deploy-mode cluster --keytab keytab_name.keytab --principal --jars keytab_name@REALM --jars /path/to/spark-hive_2.11-2.3.0.jar sranje.py

来自不在CDH环境中的气流docker容器(不由CDH CM管理(。sranje.py是从配置单元表中简单地选择*。

应用程序在CDH纱线上被接受,并执行两次,出现以下错误:

...
2020-12-31 10:11:43 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
2020-12-31 10:11:43 ERROR ApplicationMaster:70 - User application exited with status 1
2020-12-31 10:11:43 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...

我们假设";一些.jar和java依赖关系&;缺少。有什么想法吗?

详细信息

  1. 在执行spark cmd之前有一个有效的krb票证
  2. 如果我们提交--jars /path/to/spark-hive_2.11-2.3.0.jar,pyhton错误是不同的
...
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
...
  • 版本的spark(2.3.0(、hadoop(2.6.0(和java与CDH相同
  • 还提供了hive-site.xml、yarn-site.xml等
  • 这个相同的spark-submit应用程序从CDH集群内部的节点执行OK
  • 我们尝试添加额外的--jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar
  • 开发人员使用以下代码作为示例:
  • # -*- coding: utf-8 -*-
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    from pyspark.context import SparkContext
    from pyspark.sql import SparkSession, SQLContext, HiveContext, functions as F
    from pyspark.sql.utils import AnalysisException
    from datetime import datetime
    sc = SparkContext.getOrCreate()
    spark = SparkSession(sc)
    sqlContext = SQLContext(sc)
    hiveContext = HiveContext(sc)
    current_date = str(datetime.now().strftime('%Y-%m-%d'))
    hive_source = "lnz_ch.lnz_cfg_codebook"
    source_df = hiveContext.table(hive_source).na.fill("")
    print("Number of records: {}".format(source_df.count()))
    print("First 20 rows of the table:")
    source_df.show(20)
    
    1. 不同的脚本,相同的错误
    # -*- coding: utf-8 -*-
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    from pyspark.sql.types import *
    from pyspark.sql.functions import *
    from pyspark.sql import SparkSession
    if __name__ == "__main__":
    spark = SparkSession.builder.appName("ZekoTest").enableHiveSupport().getOrCreate()
    data = spark.sql("SELECT * FROM lnz_ch.lnz_cfg_codebook")
    data.show(20)
    spark.close()
    

    谢谢。

    配置单元依赖性通过解决

    • 下载带有CDH hive精确版本的hive.tar.gz
    • 创建了从hive/到spark的符号链接/ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/
    • 从maven repo下载到spark/jars的其他jars/
    hive-hcatalog-core-2.3.0.jar
    slf4j-api-1.7.26.jar
    spark-hive_2.11-2.3.0.jar
    spark-hive-thriftserver_2.11-2.3.0.jar
    
    • 刷新env-var
    HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' ' ':')
    SPARK_DIST_CLASSPATH=$(hadoop classpath)
    

    直线工作,但pyspark抛出错误

    2021-01-07 15:02:20 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
    Traceback (most recent call last):
    File "sranje.py", line 21, in <module>
    source_df = hiveContext.table(hive_source).na.fill("")
    File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
    File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
    File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
    File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o31.table.
    : java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
    

    但是,这是另一个问题。谢谢大家。

    相关内容

    • 没有找到相关文章

    最新更新