使用Pyspark读取Parquet和ORC HDFS文件



我用InputFormat创建了配置单元外部表"Org.apache.hadop.hive.ql.io.parquet.serde.MapredParquetInputFormat;并且输出格式:";Org.apache.hadop.hive.ql.io.parquet.serde.MapredParquetOutputFormat"。

如何使用Pyspark从hdfs读取这些配置单元表文件?

如果您希望SparkSQL使用配置单元元存储并访问配置单元表,那么您必须在spark-conf文件夹中添加hive-site.xml。有关更多详细信息,

from os.path import abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
# warehouse_location points to the default location for managed databases and tables
warehouse_location = abspath('spark-warehouse')
spark = SparkSession 
.builder 
.appName("Python Spark SQL Hive integration example") 
.config("spark.sql.warehouse.dir", warehouse_location) 
.enableHiveSupport() 
.getOrCreate()
spark.sql("SELECT * FROM <YOUR TABLE NAME>").show()

最新更新