我有一个Dataframe(或者我可以把它变成两个Dataframe)
+---+-----------------+--------------------+
| id| director_name| movie_title|
+---+-----------------+--------------------+
| 01| james cameron| avatar|
| 02| gore verbinski|pirates caribbean...|
| 03| sam mendes| spectre|
| 04|christopher nolan| dark knight rises|
| 05| doug walker|star wars episode...|
| 06| andrew stanton| john carter|
| 07| sam raimi| spider man 3|
| 08| nathan greno| tangled|
| 09| joss whedon| avengers age ultron|
| 10| david yates|harry potter half...|
+---+-----------------+--------------------+
我想让它看起来像这样:
+---+--------------------+
| id| key|
+---+--------------------+
| 01| james cameron|
| 02| gore verbinski|
| 03| sam mendes|
| 04| christopher nolan|
| 05| doug walker|
| 06| andrew stanton|
| 07| sam raimi|
| 08| nathan greno|
| 09| joss whedon|
| 10| david yates|
| 01| avatar|
| 02|pirates caribbean...|
| 03| spectre|
| 04| dark knight rises|
| 05|star wars episode...|
| 06| john carter|
| 07| spider man 3|
| 08| tangled|
| 09| avengers age ultron|
| 10|harry potter half...|
+---+--------------------+
我猜测Pandas方法append()做同样的事情,但我找不到pySpark的解决方案。如果我忽略了什么,我道歉!
我想避免转换成熊猫,因为这个df可能会变得相当大…
usestack
Example:
df.show()
#+---+----+----+
#| id|name|dept|
#+---+----+----+
#| 1| a| b|
#| 2| c| d|
#+---+----+----+
df.selectExpr("stack(2,string(id),name,string(id),dept)as (id,key)").show()
#+---+---+
#| id|key|
#+---+---+
#| 1| a|
#| 1| b|
#| 2| c|
#| 2| d|
#+---+---+
基于这里给出的答案,您可以做以下操作:
director_name_df = df.select(['id', 'director_name']).('director_name', 'key')
movie_title_df = df.select(['id', 'movie_title']).withColumnRenamed('movie_title', 'key')
df = director_name_df .union(movie_title_df)
以后你可以删除重复的,如果你想,希望这有助于。
这将是最简单的(如何在PySpark中将两列堆叠成单个列?):
from pyspark.sql.functions import col
movie_info = spark.createDataFrame(
[
[1, 'James Cameron', 'Avatar'],
[2, 'James Cameron', 'Titanic'],
[3, "director_man", 'movie_2']
]
).toDF(*['id', 'director', 'movie'])
df = (
movie_info
.selectExpr('id', 'explode(array(director, movie))')
.withColumnRenamed('col', 'key')
)
df.show()
但是你也可以这样做:
from pyspark.sql.functions import col
# Create data
movie_info = spark.createDataFrame(
[
[1, 'James Cameron', 'Avatar'],
[2, 'James Cameron', 'Titanic'],
[3, "director_man", 'movie_2']
]
).toDF(*['id', 'director', 'movie'])
# Only select id and director and change col to key
directors = (
movie_info
.select(
'id',
col('director').alias('key')
)
)
# Only select id and movie and switch name to key
movies = (
movie_info
.select(
'id',
col('movie').alias('key')
)
)
# Union together
unpivot = (
directors
.unionAll(movies)
)
unpivot.show()
+---+-------------+
| id| key|
+---+-------------+
| 1|James Cameron|
| 2|James Cameron|
| 3| director_man|
| 1| Avatar|
| 2| Titanic|
| 3| movie_2|
+---+-------------+