Scala Apache Spark:列名中的非标准字符



我调用以下代码:

  propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("properties.tide (above mllw)") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

这会给我以下错误:

org.apache.spark.sql。AnalysisException:没有这样的结构字段温度、大气压、露点、主导波周期、平均波向、名称、节目名称、重要浪高、潮汐(高于100米)、能见度、水温度、风向、风速;

现在确实有这样一个struct字段。(错误信息本身是这么说的)

模式如下:

 root
 |-- timestamp: long (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- properties: struct (nullable = true)
 |    |-- air temperature: double (nullable = true)
 |    |-- atmospheric pressure: double (nullable = true)
 |    |-- dew point: double (nullable = true)
          .
          .
          .
 |    |-- tide (above mllw):: string (nullable = true)
          .
          .
          .

输入被读取为JSON,如下所示:

val df = sqlContext.read.json(dirName)

如何处理列名中的括号?

首先应该避免这样的名称,但是您可以拆分访问路径:

val df = spark.range(1).select(struct(
  lit(123).as("tide (above mllw)"),
  lit(1).as("wind speed")
).as("properties"))
df.select(col("properties").getItem("tide (above mllw)"))
// or
df.select(col("properties")("tide (above mllw)"))

或将有问题的字段用反号括起来:

df.select(col("properties.`tide (above mllw)`"))

两种解决方案都假设数据包含以下结构(基于您用于查询的访问路径):

df.printSchema
// root
//  |-- properties: struct (nullable = false)
//  |    |-- tide (above mllw): integer (nullable = false)
//  |    |-- wind speed: integer (nullable = false)

根据文档,您可以尝试使用单引号。这样的:

 propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("'properties.tide (above mllw)'") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

最新更新