如何在Azure Databricks中使用PySpark重命名结构的一级键



我想重命名负载中一级对象的键。

from pyspark.sql.functions import *  
ds = {'Fruits': {'apple': {'color': 'red'},'mango': {'color': 'green'}}, 'Vegetables': None}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
|-- Fruits: struct (nullable = true)
|    |-- apple: struct (nullable = true)
|    |    |-- color: string (nullable = true)
|    |    |-- shape: string (nullable = true)
|    |-- mango: struct (nullable = true)
|    |    |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)
"""

期望输出:

root
|-- Fruits: struct (nullable = true)
|    |-- APPLE: struct (nullable = true)
|    |    |-- color: string (nullable = true)
|    |    |-- shape: string (nullable = true)
|    |-- MANGO: struct (nullable = true)
|    |    |-- color: string (nullable = true)
|-- Vegetables: string (nullable = true)

在这种情况下,我想将第一级中的键重命名为大写。

如果我有一个地图类型,我可以使用转换键:

df.select(transform_keys("Fruits", lambda k, _: upper(k)).alias("data_upper")).display()

不幸的是,我有一个结构类型。

AnalysisException:无法解析'transform_keys(Fruits,lambdafunction(upper(x_18(,x_18,y_19(('由于参数数据类型不匹配:参数1需要映射类型,但"Fruits"为structapple:struct<颜色:字符串,形状:字符串,芒果:结构颜色:字符串>类型

我使用的是Databricks运行时10.4 LTS(包括Apache Spark 3.2.1和Scala 2.12(

您尝试使用的函数(transform_keys(用于映射类型列。您的列类型是结构

您可以使用withField

from pyspark.sql import functions as F
ds = spark.createDataFrame([], 'Fruits struct<apple:struct<color:string,shape:string>,mango:struct<color:string>>, Vegetables string')
ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- apple: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- mango: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)
ds = ds.withColumn('Fruits', F.col('Fruits').withField('APPLE', F.col('Fruits.apple')))
ds = ds.withColumn('Fruits', F.col('Fruits').withField('MANGO', F.col('Fruits.mango')))
ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- APPLE: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- MANGO: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)

您也可以重新创建结构,但在重新创建时需要包括所有结构字段。

ds = ds.withColumn('Fruits', F.struct(
F.col('Fruits.apple').alias('APPLE'),
F.col('Fruits.mango').alias('MANGO'),
))
ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- APPLE: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- MANGO: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)

相关内容

  • 没有找到相关文章

最新更新