我有一个spark数据框架,我想添加一个新的列与一些特定的值。我试过使用列函数,但它不像预期的那样工作。我想要一个具有特定值的新列,或者我想替换现有的列
看这个例子
我有一个dataFrame:>>> df.show()
+-------+----+-----+---+
| name|year|month|day|
+-------+----+-----+---+
| Ali|2014| 9| 1|
| Matei|2015| 10| 26|
|Michael|2015| 10| 25|
|Reynold|2015| 10| 25|
|Patrick|2015| 9| 1|
+-------+----+-----+---+
我想为每一行添加一个信息,我可以用lit
来做
>>> from pyspark.sql.functions import lit
>>> df.withColumn('my_new_column', lit('testing info for all')).show()
+-------+----+-----+---+--------------------+
| name|year|month|day| my_new_column|
+-------+----+-----+---+--------------------+
| Ali|2014| 9| 1|testing info for all|
| Matei|2015| 10| 26|testing info for all|
|Michael|2015| 10| 25|testing info for all|
|Reynold|2015| 10| 25|testing info for all|
|Patrick|2015| 9| 1|testing info for all|
+-------+----+-----+---+--------------------+
如果您想为每行添加不同信息的列表,您可以使用explode
:
>>> from pyspark.sql.functions import explode
>>> df.withColumn('my_new_column',
... explode(array(lit('testing info for all'),
... lit('other testing again')))).show()
+-------+----+-----+---+--------------------+
| name|year|month|day| my_new_column|
+-------+----+-----+---+--------------------+
| Ali|2014| 9| 1|testing info for all|
| Ali|2014| 9| 1| other testing again|
| Matei|2015| 10| 26|testing info for all|
| Matei|2015| 10| 26| other testing again|
|Michael|2015| 10| 25|testing info for all|
|Michael|2015| 10| 25| other testing again|
|Reynold|2015| 10| 25|testing info for all|
|Reynold|2015| 10| 25| other testing again|
|Patrick|2015| 9| 1|testing info for all|
|Patrick|2015| 9| 1| other testing again|
+-------+----+-----+---+--------------------+