如何在pyspark数据帧中创建一列数组,这些数组的值来自一列,长度来自另一列



在pyspark数据帧中,我需要创建一个ArrayType(StringType(的新列,其值来自StringType((列,其长度来自另一个ArratType(StringType(((列的长度。有点像一个具有动态长度的array_preat。

输入

+-------------+-------------+
|col1         |col2         |
+-------------+-------------+
|[1,2]        |‘a’          |
|[1,2,3]      |‘b’          |
+-------------+-------------+

输出

+-------------+-------------+----------------+
|col1         |col2         |col3            |
+-------------+-------------+----------------+
|[1,2]        |‘a’          |['a’,‘a’]       |
|[1,2,3]      |‘b’          |['b’,’b’,’b’]   |
+-------------+----------- -+----------------+

感谢

另一种选择-

加载提供的测试数据

val df = spark.sql(
"""
|select col1, col2
|from values
| (array(1, 2), 'a'),
| (array(1, 2, 3), 'b')
| T(col1, col2)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +---------+----+
* |col1     |col2|
* +---------+----+
* |[1, 2]   |a   |
* |[1, 2, 3]|b   |
* +---------+----+
*
* root
* |-- col1: array (nullable = false)
* |    |-- element: integer (containsNull = false)
* |-- col2: string (nullable = false)
*/

备选方案-1


// alternative-1
df.withColumn("col3", expr("array_repeat(col2, size(col1))"))
.show(false)
/**
* +---------+----+---------+
* |col1     |col2|col3     |
* +---------+----+---------+
* |[1, 2]   |a   |[a, a]   |
* |[1, 2, 3]|b   |[b, b, b]|
* +---------+----+---------+
*/

备选方案-2


// alternative-2
df.withColumn("col3", expr(s"TRANSFORM(col1, x -> col2)"))
.show(false)
/**
* +---------+----+---------+
* |col1     |col2|col3     |
* +---------+----+---------+
* |[1, 2]   |a   |[a, a]   |
* |[1, 2, 3]|b   |[b, b, b]|
* +---------+----+---------+
*/

使用array_repeat+size:

import pyspark.sql.functions as f
df = spark.createDataFrame([[[1,2],'a'], [[1,2,3], 'b']], ['col1', 'col2'])
df.withColumn('col3', f.array_repeat('col2', f.size('col1'))).show()
+---------+----+---------+
|     col1|col2|     col3|
+---------+----+---------+
|   [1, 2]|   a|   [a, a]|
|[1, 2, 3]|   b|[b, b, b]|
+---------+----+---------+

如果在任何情况下这都不起作用,你可以写一个udf来做这件事:

from pyspark.sql.types import StringType, ArrayType
import pyspark.sql.functions as f
@f.udf(ArrayType(StringType()))
def repeat_sizeof(col1, col2):
return [col1] * len(col2)
df.withColumn('col3', repeat_sizeof('col2', 'col1')).show()
+---------+----+---------+
|     col1|col2|     col3|
+---------+----+---------+
|   [1, 2]|   a|   [a, a]|
|[1, 2, 3]|   b|[b, b, b]|
+---------+----+---------+

最新更新