在 Spark sql 中将双精度数组转换为字符串

我正在尝试从 JSON 读取数据，该 JSON 有一个具有纬度长值的数组，例如 [48.597315，-43.206085]，我想在 spark sql 中将它们解析为单个字符串。有没有办法做到这一点？

我的 JSON 输入将如下所示。

{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}

我正在尝试将其推送到 rdbms 存储，当我尝试将 position.坐标转换为字符串时，它给了我

Can't get JDBC type for array<string>

由于目标数据类型为 NVarchar。任何善意的帮助，不胜感激。！

您可以将json 文件读入数据帧，然后 1( 使用concat_ws将 lat/lon 数组字符串化为单个列，2( 使用struct重新组装position结构类型列，如下所示：

// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}
import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")
// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = true)
//  |    |-- coordinates: array (nullable = true)
//  |    |    |-- element: double (containsNull = true)
//  |    |-- type: string (nullable = true)
df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
select($"id", struct($"coordinates", $"position.type").as("position")).
show(false)
// +-----+----------------------------+
// |id   |position                    |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+
// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = false)
//  |    |-- coordinates: string (nullable = false)
//  |    |-- type: string (nullable = true)

[更新]

使用 Spark SQL：

df.createOrReplaceTempView("position_table")
spark.sql("""
select id, concat_ws(',', position.coordinates) as position_coordinates
from position_table
""").
show(false)
//+-----+--------------------+
//|id   |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+

在将给定列加载到目标数据源之前，必须将其转换为字符串。例如，下面的代码通过使用 Array 的 toString 并在之后删除括号来创建一个新的列position.coordinates，其值作为给定双精度数组的连接字符串。

df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\[|\]", ""))

或者，您可以使用 UDF 在Row对象上创建自定义转换函数。这样，您就可以维护列的嵌套结构。以下来源(答案 2(可以让你了解如何为您的案例使用 UDF：以嵌套结构作为输入参数的 Spark UDF。

相关内容

最新更新

热门标签：