Apache Spark -- 将 UDF 的结果分配给多个数据帧列

我正在使用pyspark，使用spark-csv将大型csv文件加载到数据帧中，作为预处理步骤，我需要对其中一列（包含json字符串）中的可用数据应用各种操作。这将返回 X 值，每个值都需要存储在自己的单独列中。

该功能将在 UDF 中实现。但是，我不确定如何从该 UDF 返回值列表并将其输入到各个列中。下面是一个简单的例子：

(...)
from pyspark.sql.functions import udf
def udf_test(n):
    return [n/2, n%2]
test_udf=udf(udf_test)

df.select('amount','trans_date').withColumn("test", test_udf("amount")).show(4)

这将产生以下内容：

+------+----------+--------------------+
|amount|trans_date|                test|
+------+----------+--------------------+
|  28.0|2016-02-07|         [14.0, 0.0]|
| 31.01|2016-02-07|[15.5050001144409...|
| 13.41|2016-02-04|[6.70499992370605...|
| 307.7|2015-02-17|[153.850006103515...|
| 22.09|2016-02-05|[11.0450000762939...|
+------+----------+--------------------+
only showing top 5 rows

将 udf 返回的两个（在本例中）值存储在单独的列上的最佳方法是什么？现在它们被键入为字符串：

df.select('amount','trans_date').withColumn("test", test_udf("amount")).printSchema()
root
 |-- amount: float (nullable = true)
 |-- trans_date: string (nullable = true)
 |-- test: string (nullable = true)

无法

从单个 UDF 调用创建多个顶级列，但可以创建新的struct。它需要一个具有指定returnType的 UDF：

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, FloatType
schema = StructType([
    StructField("foo", FloatType(), False),
    StructField("bar", FloatType(), False)
])
def udf_test(n):
    return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))
test_udf = udf(udf_test, schema)
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
foobars = df.select(test_udf("y").alias("foobar"))
foobars.printSchema()
## root
##  |-- foobar: struct (nullable = true)
##  |    |-- foo: float (nullable = false)
##  |    |-- bar: float (nullable = false)

您可以使用简单的select进一步展平架构：

foobars.select("foobar.foo", "foobar.bar").show()
## +---+---+
## |foo|bar|
## +---+---+
## |1.0|0.0|
## |1.5|1.0|
## +---+---+

另请参阅从 Spark 数据帧中的单个列派生多个列

您可以使用flatMap一次性获取所需的数据帧

df=df.withColumn('udf_results',udf)  
df4=df.select('udf_results').rdd.flatMap(lambda x:x).toDF(schema=your_new_schema)

相关内容

最新更新

热门标签：