我正在尝试使用spark红移库,但无法对sqlContext.read()命令创建的数据帧进行操作(从红移中读取)。
这是我的代码:
Class.forName("com.amazon.redshift.jdbc41.Driver")
val conf = new SparkConf().setAppName("Spark Application").setMaster("local[2]")
val sc = new SparkContext(conf)
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "****")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "****")
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://URL")
.option("dbtable", "table")
.option("tempdir", "s3n://bucket/folder")
.load()
df.registerTempTable("table")
val data = sqlContext.sql("SELECT * FROM table")
data.show()
这是我在scala对象的主方法中运行上述代码时收到的错误:
Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1096)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:116)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:279)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:278)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:310)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:274)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:49)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
at com.triplelift.spark.Main$.main(Main.scala:37)
at com.triplelift.spark.Main.main(Main.scala)
如果这有帮助的话,我还有我的渐变依赖项:
dependencies {
compile (
'com.amazonaws:aws-java-sdk:1.10.31',
'com.amazonaws:aws-java-sdk-redshift:1.10.31',
'org.apache.spark:spark-core_2.10:1.5.1',
'org.apache.spark:spark-streaming_2.10:1.5.1',
'org.apache.spark:spark-mllib_2.10:1.5.1',
'org.apache.spark:spark-sql_2.10:1.5.1',
'com.databricks:spark-redshift_2.10:0.5.2',
'com.fasterxml.jackson.core:jackson-databind:2.6.3'
)
testCompile group: 'junit', name: 'junit', version: '4.11'
}
不用说,在评估data.show()时会发生错误。
在一个无关的音符上。。。任何使用Intellij 14的人都知道如何将Redshift驱动程序永久添加到模块中?每次刷新渐变时,它都会从项目结构中的依赖项中删除。奇怪的
最初的问题是得到这个错误:
com.fasterxml.jackson.databind.JsonMappingException:
Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
所以我在这里遵循了这个答案:
Spark Parallelize?(找不到名为';id';的创建者属性)
因此,我添加了这一行"com.fasterxml.jackson.core:jackson databind:2.6.3",并在不同版本之间切换(即2.4.4),然后开始在项目视图中查看我的外部库。。。因此,我删除了新的jackson数据绑定依赖项,并希望看到所有能引发加载的jackson库。。。就在那时,我注意到除了jackson-module-scala_2.10,jackson库都是2.5.1,它在2.4.4上,所以我没有篡改jackson数据绑定依赖项,而是添加了以下内容:
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'
现在我的代码可以工作了。看起来火花核心1.51在投入maven之前没有正确构建?不确定。
注意:始终检查您的传递依赖关系及其版本。。。