简化代码并减少 pyspark 数据帧中的联接语句



我有一个数据框,如下所示pyspark

df.show()

+---+-------------+
| id|       device|
+---+-------------+
|  3|      mac pro|
|  1|       iphone|
|  1|android phone|
|  1|   windows pc|
|  1|   spy camera|
|  2|   spy camera|
|  2|       iphone|
|  3|   spy camera|
|  3|         cctv|
+---+-------------+

phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")

phones_df.show()

+---+------+
| id|phones|
+---+------+
|  1|     2|
|  2|     1|
+---+------+

pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")

pc_df.show()

+---+---+
| id| pc|
+---+---+
|  1|  1|
|  3|  1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")

security_df.show()

+---+--------+
| id|security|
+---+--------+
|  1|       1|
|  2|       1|
|  3|       2|
+---+--------+

然后,我想对所有三个数据框进行完整的外部连接。我已经做了如下。

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)

Final_df.show()

+---+------+----+--------+
| id|phones|  pc|security|
+---+------+----+--------+
|  1|     2|   1|       1|
|  2|     1|null|       1|
|  3|  null|   1|       2|
+---+------+----+--------+

我能够得到我想要的东西,但想简化我的代码。

1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement

我该怎么做?谁能解释一下。

以下是使用when.otherwise将列映射到类别,然后将其pivot到所需输出的一种方法:

import pyspark.sql.functions as F
df.withColumn('cat', 
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id|  pc|phones|security|
+---+----+------+--------+
|  1|   1|     2|       1|
|  3|   1|  null|       2|
|  2|null|     1|       1|
+---+----+------+--------+

最新更新