修改spark数据框列



我有一个spark数据框架,我想添加一个新的列与一些特定的值。我试过使用列函数,但它不像预期的那样工作。我想要一个具有特定值的新列,或者我想替换现有的列

看这个例子

我有一个dataFrame:
>>> df.show()
+-------+----+-----+---+
|   name|year|month|day|
+-------+----+-----+---+
|    Ali|2014|    9|  1|
|  Matei|2015|   10| 26|
|Michael|2015|   10| 25|
|Reynold|2015|   10| 25|
|Patrick|2015|    9|  1|
+-------+----+-----+---+

我想为每一行添加一个信息,我可以用lit来做

>>> from pyspark.sql.functions import lit
>>> df.withColumn('my_new_column', lit('testing info for all')).show()
+-------+----+-----+---+--------------------+
|   name|year|month|day|       my_new_column|
+-------+----+-----+---+--------------------+
|    Ali|2014|    9|  1|testing info for all|
|  Matei|2015|   10| 26|testing info for all|
|Michael|2015|   10| 25|testing info for all|
|Reynold|2015|   10| 25|testing info for all|
|Patrick|2015|    9|  1|testing info for all|
+-------+----+-----+---+--------------------+

如果您想为每行添加不同信息的列表,您可以使用explode:

>>> from pyspark.sql.functions import explode
>>> df.withColumn('my_new_column', 
...               explode(array(lit('testing info for all'), 
...                             lit('other testing again')))).show()
+-------+----+-----+---+--------------------+
|   name|year|month|day|       my_new_column|
+-------+----+-----+---+--------------------+
|    Ali|2014|    9|  1|testing info for all|
|    Ali|2014|    9|  1| other testing again|
|  Matei|2015|   10| 26|testing info for all|
|  Matei|2015|   10| 26| other testing again|
|Michael|2015|   10| 25|testing info for all|
|Michael|2015|   10| 25| other testing again|
|Reynold|2015|   10| 25|testing info for all|
|Reynold|2015|   10| 25| other testing again|
|Patrick|2015|    9|  1|testing info for all|
|Patrick|2015|    9|  1| other testing again|
+-------+----+-----+---+--------------------+

相关内容

  • 没有找到相关文章