我用InputFormat创建了配置单元外部表"Org.apache.hadop.hive.ql.io.parquet.serde.MapredParquetInputFormat;并且输出格式:";Org.apache.hadop.hive.ql.io.parquet.serde.MapredParquetOutputFormat"。
如何使用Pyspark从hdfs读取这些配置单元表文件?
如果您希望SparkSQL使用配置单元元存储并访问配置单元表,那么您必须在spark-conf文件夹中添加hive-site.xml。有关更多详细信息,
from os.path import abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')
spark = SparkSession
.builder
.appName("Python Spark SQL Hive integration example")
.config("spark.sql.warehouse.dir", warehouse_location)
.enableHiveSupport()
.getOrCreate()
spark.sql("SELECT * FROM <YOUR TABLE NAME>").show()