在Spark DataFrame中的数组[Long]中替换Seperator

我正在将JSON文件读为Scala中的火花数据框架。我有一个JSON字段，例如

"areaGlobalIdList":[2389,3,2,1,2147,2142,2518]

Spark自动将该字段的数据类型作为数组[long]推断。我尝试了concat_ws，但似乎仅适用于数组[String]。当我尝试将其转换为数组[字符串]时，输出显示为

scala> val cmrdd = sc.textFile("/user/nkthn/cm.json")
scala> val cmdf = sqlContext.read.json(cmrdd)
scala> val dfResults = cmdf.select($"areaGlobalIdList".cast(StringType)).withColumn("AREAGLOBALIDLIST", regexp_replace($"areaGlobalIdList" , ",", "." ))
scala> dfResults.show(20,false)

+------------------------------------------------------------------+
|AREAGLOBALIDLIST                                                  |
+------------------------------------------------------------------+
|org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@6364b584|
+------------------------------------------------------------------+

我希望输出为

[2389.3.2.1.2147.2142.2518]

任何帮助都非常有帮助。

给定areaGlobalIdList列的schema

 |-- areaGlobalIdList: array (nullable = true)
 |    |-- element: long (containsNull = false)

您可以用简单的udf功能作为

实现此目标

import org.apache.spark.sql.functions._
val concatWithDot = udf((array: collection.mutable.WrappedArray[Long]) => array.mkString("."))
df.withColumn("areaGlobalIdList", concatWithDot($"areaGlobalIdList")).show(false)

相关内容

最新更新

热门标签：