Spark在dataFrame上的分区策略的混乱



在下面的所有四个打印语句中,我都会获得相同数量的分区(200)。初始数据帧(DF1)分为4列(account_id, schema_name, table_name, column_name)。但是随后的数据范围仅在3个字段(account_id, schema_name, table_name)上分区。有人可以向我解释一下,如果Spark能够保留STEP1-STEP4的分区策略,并且不需要在STEP1之后再进行数据。

val query1: String = "SELECT account_id, schema_name, table_name, 
column_name, COLLECT_SET(u.query_id) AS query_id_set FROM usage_tab 
GROUP BY account_id, schema_name, table_name, column_name"
val df1 = session.sql(query1)
println("1 " + df.rdd.getNumPartitions)

df1.createOrReplaceTempView("wtftempusage")
val query2 = "SELECT DISTINCT account_id, schema_name, table_name 
FROM wtftempusage"
val df2 = session.sql(query2)
println("2 " + df2.rdd.getNumPartitions)

//MyFuncIterator retains all columns for df2 and adds an additional column
val extendedDF = df2.mapPartitions(MyFuncIterator)
println("3 " + extendedDF.rdd.getNumPartitions)

val joinedDF = df1.join(extendedDF, Seq("account_id", "schema_name", "table_name"))
println("4 " + joinedDF.rdd.getNumPartitions)

谢谢devj

df api中的默认拆分分区的默认号为200。

您可以将默认的shuffle.partitons设置为较小的数字。说:sqlcontext.setconf(" spark.sql.shuffle.partitions"," 5")

最新更新