在pyspark数据帧中,我需要创建一个ArrayType(StringType(的新列,其值来自StringType((列,其长度来自另一个ArratType(StringType(((列的长度。有点像一个具有动态长度的array_preat。
输入:
+-------------+-------------+
|col1 |col2 |
+-------------+-------------+
|[1,2] |‘a’ |
|[1,2,3] |‘b’ |
+-------------+-------------+
输出:
+-------------+-------------+----------------+
|col1 |col2 |col3 |
+-------------+-------------+----------------+
|[1,2] |‘a’ |['a’,‘a’] |
|[1,2,3] |‘b’ |['b’,’b’,’b’] |
+-------------+----------- -+----------------+
感谢
另一种选择-
加载提供的测试数据
val df = spark.sql(
"""
|select col1, col2
|from values
| (array(1, 2), 'a'),
| (array(1, 2, 3), 'b')
| T(col1, col2)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +---------+----+
* |col1 |col2|
* +---------+----+
* |[1, 2] |a |
* |[1, 2, 3]|b |
* +---------+----+
*
* root
* |-- col1: array (nullable = false)
* | |-- element: integer (containsNull = false)
* |-- col2: string (nullable = false)
*/
备选方案-1
// alternative-1
df.withColumn("col3", expr("array_repeat(col2, size(col1))"))
.show(false)
/**
* +---------+----+---------+
* |col1 |col2|col3 |
* +---------+----+---------+
* |[1, 2] |a |[a, a] |
* |[1, 2, 3]|b |[b, b, b]|
* +---------+----+---------+
*/
备选方案-2
// alternative-2
df.withColumn("col3", expr(s"TRANSFORM(col1, x -> col2)"))
.show(false)
/**
* +---------+----+---------+
* |col1 |col2|col3 |
* +---------+----+---------+
* |[1, 2] |a |[a, a] |
* |[1, 2, 3]|b |[b, b, b]|
* +---------+----+---------+
*/
使用array_repeat
+size
:
import pyspark.sql.functions as f
df = spark.createDataFrame([[[1,2],'a'], [[1,2,3], 'b']], ['col1', 'col2'])
df.withColumn('col3', f.array_repeat('col2', f.size('col1'))).show()
+---------+----+---------+
| col1|col2| col3|
+---------+----+---------+
| [1, 2]| a| [a, a]|
|[1, 2, 3]| b|[b, b, b]|
+---------+----+---------+
如果在任何情况下这都不起作用,你可以写一个udf来做这件事:
from pyspark.sql.types import StringType, ArrayType
import pyspark.sql.functions as f
@f.udf(ArrayType(StringType()))
def repeat_sizeof(col1, col2):
return [col1] * len(col2)
df.withColumn('col3', repeat_sizeof('col2', 'col1')).show()
+---------+----+---------+
| col1|col2| col3|
+---------+----+---------+
| [1, 2]| a| [a, a]|
|[1, 2, 3]| b|[b, b, b]|
+---------+----+---------+