在pandas.append()中添加PySpark数据帧(垂直)



我有一个Dataframe(或者我可以把它变成两个Dataframe)

+---+-----------------+--------------------+
| id|    director_name|         movie_title|
+---+-----------------+--------------------+
| 01|    james cameron|              avatar|
| 02|   gore verbinski|pirates caribbean...|
| 03|       sam mendes|             spectre|
| 04|christopher nolan|   dark knight rises|
| 05|      doug walker|star wars episode...|
| 06|   andrew stanton|         john carter|
| 07|        sam raimi|        spider man 3|
| 08|     nathan greno|             tangled|
| 09|      joss whedon| avengers age ultron|
| 10|      david yates|harry potter half...|
+---+-----------------+--------------------+

我想让它看起来像这样:

+---+--------------------+
| id|                 key|
+---+--------------------+
| 01|       james cameron|
| 02|      gore verbinski|
| 03|          sam mendes|
| 04|   christopher nolan|
| 05|         doug walker|
| 06|      andrew stanton|
| 07|           sam raimi|
| 08|        nathan greno|
| 09|         joss whedon|
| 10|         david yates|
| 01|              avatar|
| 02|pirates caribbean...|
| 03|             spectre|
| 04|   dark knight rises|
| 05|star wars episode...|
| 06|         john carter|
| 07|        spider man 3|
| 08|             tangled|
| 09| avengers age ultron|
| 10|harry potter half...|
+---+--------------------+

我猜测Pandas方法append()做同样的事情,但我找不到pySpark的解决方案。如果我忽略了什么,我道歉!

我想避免转换成熊猫,因为这个df可能会变得相当大…

usestack

Example:

df.show()
#+---+----+----+
#| id|name|dept|
#+---+----+----+
#|  1|   a|   b|
#|  2|   c|   d|
#+---+----+----+
df.selectExpr("stack(2,string(id),name,string(id),dept)as (id,key)").show()
#+---+---+
#| id|key|
#+---+---+
#|  1|  a|
#|  1|  b|
#|  2|  c|
#|  2|  d|
#+---+---+

基于这里给出的答案,您可以做以下操作:

director_name_df = df.select(['id', 'director_name']).('director_name', 'key')
movie_title_df = df.select(['id', 'movie_title']).withColumnRenamed('movie_title', 'key')
df = director_name_df .union(movie_title_df)

以后你可以删除重复的,如果你想,希望这有助于。

这将是最简单的(如何在PySpark中将两列堆叠成单个列?):

from pyspark.sql.functions import col
movie_info = spark.createDataFrame(
[
[1, 'James Cameron', 'Avatar'],
[2, 'James Cameron', 'Titanic'],
[3, "director_man", 'movie_2']
]
).toDF(*['id', 'director', 'movie'])
df = (
movie_info
.selectExpr('id', 'explode(array(director, movie))')
.withColumnRenamed('col', 'key')
)
df.show()

但是你也可以这样做:

from pyspark.sql.functions import col
# Create data
movie_info = spark.createDataFrame(
[
[1, 'James Cameron', 'Avatar'],
[2, 'James Cameron', 'Titanic'],
[3, "director_man", 'movie_2']
]
).toDF(*['id', 'director', 'movie'])
# Only select id and director and change col to key
directors = (
movie_info
.select(
'id',
col('director').alias('key')
)
)
# Only select id and movie and switch name to key
movies = (
movie_info
.select(
'id',
col('movie').alias('key')
)
)

# Union together
unpivot = (
directors
.unionAll(movies)
)
unpivot.show()
+---+-------------+
| id|          key|
+---+-------------+
|  1|James Cameron|
|  2|James Cameron|
|  3| director_man|
|  1|       Avatar|
|  2|      Titanic|
|  3|      movie_2|
+---+-------------+

最新更新