如果我有现有的数据帧,并且我想向该数据帧添加新列
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import Row
numbers=[1,2,30,4]
rdd1 = sc.parallelize(li)
row_rdd = rdd1.map(lambda x: Row(x))
test_df = sqlContext.createDataFrame(row_rdd,['numbers'])
-------------------------------------------------------------------------
test_df.show()
-------------------------------------------------------------------------
+-------+
|numbers|
+-------+
| 1|
| 2|
| 30|
| 4|
+-------+
-------------------------------------------------------------------------
#add list to column exist dataframe
rating = [40,32,12,21]
rdd2 = sc.parallelize(li2)
row_rdd2 = rdd2.map(lambda x: Row(x))
test_df2 = test_df.withColumn("rating", row_rdd2)
我的期望
+-------+--------+
|numbers|rating |
+-------+--------+
| 1| 40|
| 2| 32|
| 30| 12|
| 4| 21|
+-------+--------+
真实
AssertionError: col should be Column
如何解决?将列表添加到列数据帧pyspark
感谢
实现这一点的快速方法是为两个数据帧创建连接键,并使用该键进行连接。
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
test_df = test_df.withColumn("idx", F.monotonically_increasing_id())
test_df2 = test_df2.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
test_df = test_df.withColumn("idx", F.row_number().over(windowSpec))
test_df2 = test_df2.withColumn("idx", F.row_number().over(windowSpec))
df = test_df.join(test_df2, on='idx', how='inner').drop("idx")