如何在PySpark groupby列字符串数据类型?

我想按多个列分组，但其中一个列具有字符串类型的值。在这里发布一个样本数据集，我使用的DataFrame有多个int和1个string类型列。

给定DataFrame:

# |Year|           Movie|
# +----+----------------+
# |2020|       Inception|
# |2018|   The Godfather|
# |2018| The Dark Knight|
# |2015|    12 Angry Men|
# |2020|Schindler's List|
# |2015|    Pulp Fiction|
# |2018|      Fight Club|

需要DataFrame:

# |Year|Movie                                       |
# +----+--------------------------------------------+
# |2020|[Inception, Schindler's List]               |
# |2018|[The Godfather, The Dark Knight, Fight Club]|
# |2015|[12 Angry Men, Pulp Fiction]                |

您可以按年分组，并使用collect_set对列表中的所有项进行分组:

import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
('2020','Inception'),
('2018','The Godfather'),
('2018','The Dark Knight'),
('2015','12 Angry Men'),
('2020','Schindlers List'),
('2015','Pulp Fiction'),
('2018','Fight Club')
], ["Year", "Movie"])
.withColumn('Year', F.col('Year').cast('integer'))
.withColumn('Movie', F.col('Movie').cast('string'))

# +----+---------------+
# |Year|          Movie|
# +----+---------------+
# |2020|      Inception|
# |2018|  The Godfather|
# |2018|The Dark Knight|
# |2015|   12 Angry Men|
# |2020|Schindlers List|
# |2015|   Pulp Fiction|
# |2018|     Fight Club|
# +----+---------------+
df
.groupby("Year")
.agg(F.collect_set("Movie"))
.show(truncate=False)
# +----+--------------------------------------------+
# |Year|collect_set(Movie)                          |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men]                |
# |2020|[Schindlers List, Inception]                |
# +----+--------------------------------------------+

相关内容

最新更新

热门标签：