Pyspark DataFrame操纵效率

假设我有以下数据框架：

+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
|       123|  989| 0.9|      0|
|       123|  990| 0.8|      1|
|       123|  999| 0.7|      0|
|       234|  789| 0.9|      0|
|       234|  777| 0.7|      0|
|       234|  769| 0.6|      1|
|       234|  798| 0.5|      0|
+----------+-----+----+-------+

i然后执行以下操作以获取最终数据集（如下所示）：

# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise 
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
    sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
    sortedIds = [i[0] for i in sortedVersion]
    return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
    setOfDocs = [clickedAdId]
    return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted        |ad_id_set|
+----------+--------------------+---------+
|234       |[789, 777, 769, 798]|[769]    |
|123       |[989, 990, 999]     |[990]    |
+----------+--------------------+---------+

是否有更有效的方法可以做到这一点？以我的代码中的瓶颈进行这套转换。我非常感谢任何反馈。

我没有进行任何计时比较，但是我认为不使用任何UDFS Spark应该能够最佳地优化自身。

#scala:  val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")

df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id|        ad_id_sorted|
+----------+--------------------+
|       234|[789, 777, 769, 798]|
|       123|     [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
|       234|      769|
|       123|      990|
+----------+---------+

final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id|        ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
|       234|[789, 777, 769, 798]|      769|
|       123|     [989, 990, 999]|      990|
+----------+--------------------+---------+

我没有将AD_ID_SET放入数组中，因为您正在计算最大值，而最大值只能返回1个值。我敢肯定，如果您真的需要在数组中，就可以实现这一目标。

如果未来的某人使用Scala有类似的问题，则包括微妙的Scala差异。

相关内容

最新更新

热门标签：