我想按多个列分组,但其中一个列具有字符串类型的值。在这里发布一个样本数据集,我使用的DataFrame有多个int
和1个string
类型列。
给定DataFrame:
# |Year| Movie|
# +----+----------------+
# |2020| Inception|
# |2018| The Godfather|
# |2018| The Dark Knight|
# |2015| 12 Angry Men|
# |2020|Schindler's List|
# |2015| Pulp Fiction|
# |2018| Fight Club|
需要DataFrame:
# |Year|Movie |
# +----+--------------------------------------------+
# |2020|[Inception, Schindler's List] |
# |2018|[The Godfather, The Dark Knight, Fight Club]|
# |2015|[12 Angry Men, Pulp Fiction] |
您可以按年分组,并使用collect_set对列表中的所有项进行分组:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
('2020','Inception'),
('2018','The Godfather'),
('2018','The Dark Knight'),
('2015','12 Angry Men'),
('2020','Schindlers List'),
('2015','Pulp Fiction'),
('2018','Fight Club')
], ["Year", "Movie"])
.withColumn('Year', F.col('Year').cast('integer'))
.withColumn('Movie', F.col('Movie').cast('string'))
# +----+---------------+
# |Year| Movie|
# +----+---------------+
# |2020| Inception|
# |2018| The Godfather|
# |2018|The Dark Knight|
# |2015| 12 Angry Men|
# |2020|Schindlers List|
# |2015| Pulp Fiction|
# |2018| Fight Club|
# +----+---------------+
df
.groupby("Year")
.agg(F.collect_set("Movie"))
.show(truncate=False)
# +----+--------------------------------------------+
# |Year|collect_set(Movie) |
# +----+--------------------------------------------+
# |2018|[The Godfather, Fight Club, The Dark Knight]|
# |2015|[Pulp Fiction, 12 Angry Men] |
# |2020|[Schindlers List, Inception] |
# +----+--------------------------------------------+