无法将表保存到配置单元元存储，HDP 3.0

我再也不能使用元存储将表保存到配置单元数据库了。我使用spark.sql在spark中看到了表，但在hive数据库中看不到相同的表。我试过这个，但它不能把桌子收纳起来。如何配置配置单元元存储？火花版本为2.3.1。

如果您想了解更多详细信息，请发表评论。

%spark
import org.apache.spark.sql.SparkSession
val spark = (SparkSession
.builder
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://xxxxxx.xxx:9083") // replace with your hivemetastore service's thrift url
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate())
spark.conf.get("spark.sql.warehouse.dir")// Output: res2: String = /apps/spark/warehouse
spark.conf.get("hive.metastore.warehouse.dir")// NotSuchElement Exception
spark.conf.get("spark.hadoop.hive.metastore.uris")// NotSuchElement Exception
var df = (spark
.read
.format("parquet")
.load(dataPath)
df.createOrReplaceTempView("my_temp_table");
spark.sql("drop table if exists my_table");
spark.sql("create table my_table using hive as select * from my_temp_table");
spark.sql("show tables").show(false)// I see my_table in default database

在@catpaws回答后更新：HDP 3.0及更高版本，Hive和Spark使用独立目录

将表格保存到spark目录：

df.createOrReplaceTempView("my_temp_table");
spark.sql("create table my_table as select * from my_temp_table");

与

将表格保存到配置单元目录：

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.createTable("newTable")
.ifNotExists()
.column("ws_sold_time_sk", "bigint")
...// x 200 columns
.column("ws_ship_date_sk", "bigint")
.create()
df.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", "newTable")
.save()

正如您在这种方式中看到的，Hive Warehouse连接器对于具有百列的数据帧来说是非常不切实际的。有什么方法可以将大型数据帧保存到配置单元中吗？

正如@catpaws所说，Spark和Hive使用独立的目录。要使用Hive Warehouse Connector保存具有多列的数据帧，您可以使用我的功能：

save_table_hwc(df1, "default", "table_test1")
def save_table_hwc(df: DataFrame, database: String, tableName: String) : Unit = {
hive.setDatabase(database)
hive.dropTable(tableName, true, false)
hive.createTable(tableName)
var table_builder = hive.createTable(tableName)
for( i <- 0 to df.schema.length-1){
var name = df.schema.toList(i).name.replaceAll("[^\p{L}\p{Nd}]+", "")
var data_type = df.schema.toList(i).dataType.sql
table_builder = table_builder.column(name, data_type)
}
table_builder.create()
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", tableName).save()
}

来自Hortonworks文档：在HDP 3.0及更高版本中，Spark和配置单元使用独立的目录来访问相同或不同平台上的SparkSQL或配置单元表。Spark创建的表位于Spark目录中。配置单元创建的表位于配置单元目录中。数据库属于目录名称空间，类似于表属于数据库名称空间的方式。尽管这些表是独立的，但它们可以互操作，并且您可以在配置单元目录中看到Spark表，但仅当使用配置单元仓库连接器时。

使用HWC API的写入操作将DataFrame写入配置单元。

更新：您现在可以(通过使用HDP 3.1(创建一个DataFrame，如果表示DataFrame的配置单元表不存在，则配置单元仓库连接器会创建它，如HDP 3.1文档：中所示

df = //Create DataFrame from any source
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
df.write.format(HIVE_WAREHOUSE_CONNECTOR)
.option("table", "my_Table")
.save()

相关内容

最新更新

热门标签：