我正在学习Spark SQL,并对Spark的SessionCatalog和Hive Metastore感到困惑。
我了解到,HivesessionStateBuilder将与HivesessionCatalog创建新的分析仪。
这是否意味着我们可以在一个Spark SQL中加入Hive Table和内存表?
/**
* Create a [[HiveSessionCatalog]].
*/
override protected lazy val catalog: HiveSessionCatalog = {
val catalog = new HiveSessionCatalog(
externalCatalog,
session.sharedState.globalTempViewManager,
new HiveMetastoreCatalog(session),
functionRegistry,
conf,
SessionState.newHadoopConf(session.sparkContext.hadoopConfiguration, conf),
sqlParser,
resourceLoader)
parentState.foreach(_.catalog.copyStateTo(catalog))
catalog
}
/**
* A logical query plan `Analyzer` with rules specific to Hive.
*/
override protected def analyzer: Analyzer = new Analyzer(catalog, conf) {
override val extendedResolutionRules: Seq[Rule[LogicalPlan]] =
new ResolveHiveSerdeTable(session) +:
new FindDataSourceTable(session) +:
new ResolveSQLOnFile(session) +:
customResolutionRules
override val postHocResolutionRules: Seq[Rule[LogicalPlan]] =
new DetermineTableStats(session) +:
RelationConversions(conf, catalog) +:
PreprocessTableCreation(session) +:
PreprocessTableInsertion(conf) +:
DataSourceAnalysis(conf) +:
HiveAnalysis +:
customPostHocResolutionRules
override val extendedCheckRules: Seq[LogicalPlan => Unit] =
PreWriteCheck +:
customCheckRules
}
是的,火花可以连接蜂巢表和内存表。两种类型的数据源的常见抽象是数据框架。因此,如果您阅读蜂巢表,则为
val dfhive = spark.read.table("hivetable")
val df = spark.read.parquet("sqltable")
在这里dd datyframe的DF和DFHIVE都可以使用DataFrame API或Spark SQL加入它们。