如何避免分区列上的排序合并联接中随机播放

我们得到了两个数据集，它们被持久化如下：

数据集 A：

datasetA.repartition(5, datasetA.col("region")) 
                .write().mode(saveMode) 
                .format("parquet") 
                .partitionBy("region") 
                .bucketBy(5,"studentId") 
                .sortBy("studentId") 
                .option("path", parquetFilesDirectory) 
                .saveAsTable( database.tableA));

数据集 B：

datasetB.repartition(5, datasetB.col("region")) 
                .write().mode(saveMode) 
                .format("parquet") 
                .partitionBy("region") 
                .bucketBy(5,"studentId") 
                .sortBy("studentId") 
                .option("path", parquetFilesDirectory) 
                .saveAsTable( database.tableB));

加入区域和学生 ID 会导致数据混乱。下面是连接查询：

spark.sql("Select count(*)  from  database.tableA a, database.tableB b where a.studentId = b.studentId and a.region = b.region").show()

当我们包含分区键时，洗牌的原因可能是什么我们如何减轻它？

是的，

您可以使用预排序和分组表来缓解随机播放

相关内容

最新更新

热门标签：