带有配置单元的Spark,无法实例化带有配置单元支持的SparkSession,因为找不到配置单元类



spark应用程序将从Hive:加载数据

SparkSession spark = SparkSession.builder()
.appName(topics)
.config("hive.metastore.uris", "thrift://device1:9083")
.enableHiveSupport()
.getOrCreate();

我通过以下方式点燃火花:

spark-submit --master local[*] --class zhihu.SparkConsumer target/original-kafka-consumer-0.1-SNAPSHOT.jar  --jars spark-hive_2.11-2.4.4.jar

maven pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.zhihu</groupId>
<artifactId>kafka-consumer</artifactId>
<packaging>jar</packaging>
<version>0.1-SNAPSHOT</version>
<name>kafkadev</name>
<url>http://maven.apache.org</url>
<repositories>
<repository>
<!-- Proper URL for Cloudera maven artifactory -->
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>2.8.2</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.0</version>
<exclusions>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.log4j</groupId>
<artifactId>log4j-core</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
<scope>compile</scope>
</dependency>
<!-- gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>2.1.1-cdh6.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>2.1.1-cdh6.2.0</version>
</dependency>
<!-- runtime Hive -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-beeline</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-shims</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-contrib</artifactId>
<version>2.1.1-cdh6.2.0</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>**/Log4j2Plugins.dat</exclude>
</excludes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<artifactSet>
<excludes>
<exclude>classworlds:classworlds</exclude>
<exclude>junit:junit</exclude>
<exclude>jmock:*</exclude>
<exclude>*:xml-apis</exclude>
<exclude>org.apache.maven:lib:tests</exclude>
</excludes>
</artifactSet>
<skip>true</skip>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

它看起来没有问题,但它总是提出:

20/05/07 12:03:17 INFO spark.SparkContext: Added JAR file:/data/projects/zhihu_scraper/consumers/target/original-kafka-consumer-0.1-SNAPSHOT.jar at spark://device2:42395/jars/original-kafka-consumer-0.1-SNAPSHOT.jar with timestamp 1588824197724
20/05/07 12:03:17 INFO executor.Executor: Starting executor ID driver on host localhost
20/05/07 12:03:17 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33849.
20/05/07 12:03:17 INFO netty.NettyBlockTransferService: Server created on device2:33849
20/05/07 12:03:17 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/07 12:03:17 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManagerMasterEndpoint: Registering block manager device2:33849 with 366.3 MB RAM, BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, device2, 33849, None)
20/05/07 12:03:17 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63e5e5b4{/metrics/json,null,AVAILABLE,@Spark}
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:869)
at zhihu.SparkConsumer.main(SparkConsumer.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/07 12:03:18 INFO spark.SparkContext: Invoking stop() from shutdown hook

我已经在这篇文章中尝试了所有的答案如何在Hive支持下创建SparkSession。但是,它们都不适合我。

<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.4</version>
<scope>compile</scope>
</dependency>

我不知道为什么compile是应该是runtime的范围。由于您使用的是maven shade插件,因此您可以将具有所有依赖项的uber jar(带有target/original-kafka-consumer-0.1-SNAPSHOT.jar(打包在一个保护伞/归档中,并且它将位于类路径中,这样就不会遗漏任何内容。

此外,hive-site.xml应该在类路径中。则不需要单独配置元存储。以程序化的方式。

进一步读取

我的pom(通过MvnRepository.com(是这样的,

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.13</artifactId>
<version>3.3.2</version>
<scope>provided</scope>
</dependency>

从中取出<scope>provided</scope>,为我修复

最新更新