如何在pyspark 2.0中读取没有元存储的ORC文件



我想在没有元存储的情况下使用pyspark 2.0读取一些ORC文件。从理论上讲,这样做是可行的,因为数据模式嵌入在ORC文件中。但我得到的是:

[me@hostname~]$/usr/local/spark-2.0.0-bin-hadoop2.6/bin/pysparkPython 2.7.11(默认,2016年2月18日13:54:48)linux2上的[GCC 4.4.7 20120313(Red Hat 4.4.7-16)]键入"帮助"、"版权"、"信用"或"许可证"以获取详细信息。将默认日志级别设置为"警告"。要调整日志记录级别,请使用sc.setLogLevel(newLevel)。欢迎访问______/__/_________//___\\/\/_`/_/'_//__/.__/\_,_/\//_/\_\2.0.0版本/_/使用Python 2.7.11版本(默认,2016年2月18日13:54:48)SparkSession可用作"spark"。>>>df=spark.read.orc('/my/orc/file')21年8月16日22:29:38警告。NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类16/08/21 22:30:00错误元存储。正在重试HMSMHandler:AlreadyExistsException(消息:数据库默认值已存在)网址:org.apache.hadop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)位于java.lang.reflect.Method.ioke(Method.java:498)网址:org.apache.hadop.hive.metastore.RetryingHMSHandler.ioke(RetryingHMSMandler.java:107)网址:com.sun.proxy.$Proxy21.create_database(未知来源)网址:org.apache.hadop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)位于java.lang.reflect.Method.ioke(Method.java:498)网址:org.apache.hadop.hive.metastore.RetryingMetaStoreClient.ioke(RetryingMetaStoreClient.java:156)网址:com.sun.proxy.$Proxy22.createDatabase(未知来源)网址:org.apache.hadop.hive.ql.metadata.hive.createDatabase(hive.java:306)在org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:291)网址:org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)网址:org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)在org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:262)网址:org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:209)网址:org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:208)位于org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:251)网址:org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:290)在org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:99)位于org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)位于org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)网址:org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)网址:org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:98)网址:org.apache.spark.sqlcatalyst.contact.SessionCatalog.createDatabase(SessionCatalog.scala:147)网址:org.apache.spark.sqlcatalyst.contact.SessionCatalog.(SessionCatalog.scala:89)网址:org.apache.spark.sql.hive.HiveSessionCatalog.(HiveSessionCatalog.scala:51)网址:org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)网址:org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)网址:org.apache.spark.sql.hive.HiveSessionState$$anon$1。(HiveSessionState.scala:63)网址:org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)网址:org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)网址:org.apache.spark.sql.execution.QueryExecution.assertAnalytical(QueryExecution.scala:49)网址:org.apache.spark.sql.Dataset$.Rows(数据集.scala:64)网址:org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)网址:org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)网址:org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:450)网址:org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:439)在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)位于java.lang.reflect.Method.ioke(Method.java:498)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)在py4j.reflection.ReflectionEngine.reinvoke(ReflectionEngine.java:357)在py4j.Gateway。invoke(Gateway。java:280)在py4j.commands.AbstractCommand.invokeMethod(AbstractCmd.java:128)在py4j.commands.CallCommand.execute(CallCommand.java:79)在py4j.GatewayConnection.run(GatewayConnection.java:211)在java.lang.Thread.run(线程.java:745)>>>

读取ORC文件的正确方法是什么?

我解决了这个问题。尽管pyspark报告了ERROR,但将数据从ORC文件加载到数据帧实际上并没有失败。尽管有错误消息,返回的数据帧可以被引用而没有任何问题。

相关内容

  • 没有找到相关文章

最新更新