如何转换数组的数组(字符串类型)到结构- Spark/Scala?



我有一个数据框架

+---------------------------------------------------------------+---+
|family_name                                                    |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
+---------------------------------------------------------------+---+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)

我希望将列fam_name转换为命名结构数组

`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>

我可以将family_name转换为数组,如下所示

val sch = ArrayType(ArrayType(StringType))
val fam_array = data
.withColumn("family_name_clean", regexp_replace($"family_name", "\[\[", "["))
.withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\]\]", "]"))
.withColumn("ar", toArray($"family_name_clean_clean1"))
//.withColumn("ar1", from_json($"ar", sch))
fam_array.show(false)
fam_array.printSchema()
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name                                                    |id |family_name_clean                                             |family_name_clean_clean1                                     |ar                                                                     |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John,  Doe,  Married,  999-999-9999], [Jane,  Doe,  Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|[Tom, Riddle, Single, 888-888-8888]]                          |[Tom, Riddle, Single, 888-888-8888]                          |[[Tom,  Riddle,  Single,  888-888-8888]]                               |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
|-- family_name_clean: string (nullable = true)
|-- family_name_clean_clean1: string (nullable = true)
|-- ar: array (nullable = true)
|    |-- element: string (containsNull = true)

sch是期望类型的模式变量。

如何将列ar转换为array<struct<>>?

编辑:

我正在使用Spark 2.3.2

要在给定字符串数组的情况下创建结构数组,可以使用struct函数在给定列列表的情况下构建结构,并结合element_at函数在数组的特定索引处提取列元素。

要解决你的具体问题,正如你正确陈述的,你需要做两件事:

  • 首先,将字符串转换为字符串数组的数组
  • 然后,使用这个字符串数组的数组来构建你的结构体

在Spark 3.0及更高版本

使用Spark 3.0,我们可以使用Spark内置函数执行所有这些步骤。

对于第一步,我将做如下操作:

  • 首先使用regexp_replace函数从family_name字符串中删除[[]]
  • 然后,通过使用split函数
  • 拆分字符串创建第一个数组级别
  • 然后,通过使用transformsplit函数拆分前一个数组的每个元素创建第二个数组级别

第二步,使用struct函数构建struct,使用element_at函数在数组中挑选元素。

因此,使用Spark 3.0及更高版本的完整代码如下,以data作为输入数据框:

import org.apache.spark.sql.functions.{col, element_at, regexp_replace, split, struct, transform}
val result = data
.withColumn(
"family_name", 
transform( 
split( // first level split
regexp_replace(col("family_name"), "\[\[|]]", ""), // remove [[ and ]]
"],\["
), 
x => split(x, ",") // split for each element in first level array
)
)
.withColumn("family_name", transform(col("family_name"), x => struct(
element_at(x, 1).as("f_name"), // index starts at 1
element_at(x, 2).as("l_name"),
element_at(x, 3).as("status"),
element_at(x, -1).as("ph_no"), // get last element of array
)))

在Spark 2中。X

使用SparkX,我们必须依赖于用户定义的函数。首先,我们需要定义一个代表structcase class:

case class FamilyName(
f_name: String, 
l_name: String, 
status: String, 
ph_no: String
)

然后,定义用户定义函数并将其应用于输入数据帧:

import org.apache.spark.sql.functions.{col, udf}
val extract_array = udf((familyName: String) => familyName
.replaceAll("\[\[|]]", "")
.split("],\[")
.map(familyName => {
val explodedFamilyName = familyName.split(",", -1)
FamilyName(
f_name = explodedFamilyName(0),
l_name = explodedFamilyName(1),
status = explodedFamilyName(2),
ph_no = explodedFamilyName(explodedFamilyName.length - 1)
)
})
)
val result = data.withColumn("family_name", extract_array(col("family_name")))
结果

如果您有以下data数据帧:

+---------------------------------------------------------------+---+
|family_name                                                    |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
+---------------------------------------------------------------+---+

您将得到以下result数据帧:

+-----------------------------------------------------------------+---+
|family_name                                                      |id |
+-----------------------------------------------------------------+---+
|[{John,  Doe,  Married,  999-999-9999}, {Jane,  Doe,  Married, }]|id1|
|[{Tom,  Riddle,  Single,  888-888-8888}]                         |id2|
+-----------------------------------------------------------------+---+

具有以下模式:

root
|-- family_name: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- f_name: string (nullable = true)
|    |    |-- l_name: string (nullable = true)
|    |    |-- status: string (nullable = true)
|    |    |-- ph_no: string (nullable = true)
|-- id: string (nullable = true)

相关内容

  • 没有找到相关文章

最新更新