我是Spark和Scala的新手,我被困在这个异常上,我正在尝试添加一些额外的字段,即StructField到使用Spark SQL从数据框中检索的列的现有StructType中,并在异常下方获取。
代码片段:
val dfStruct:StructType=parquetDf.select("columnname").schema
dfStruct.add("newField","IntegerType",true)
线程"main"中的异常
org.apache.spark.sql.types.DataTypeException: Unsupported dataType: IntegerType. If you have a struct and a field name of it has any special characters, please use backticks (`) to quote that field name, e.g. `x+y`. Please note that backtick itself is not supported in a field name.
at org.apache.spark.sql.types.DataTypeParser$class.toDataType(DataTypeParser.scala:95)
at org.apache.spark.sql.types.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:107)
at org.apache.spark.sql.types.DataTypeParser$.parse(DataTypeParser.scala:111)
我可以看到 jira 上运行了一些与此异常相关的未解决问题,但无法理解太多。我正在使用Spark 1.5.1版本
https://mail-archives.apache.org/mod_mbox/spark-issues/201508.mbox/%3CJIRA.12852533.1438855066000.143133.1440397426473@Atlassian.JIRA%3E
https://mail-archives.apache.org/mod_mbox/spark-issues/201508.mbox/%3CJIRA.12852533.1438855066000.143133.1440397426473@Atlassian.JIRA%3E
https://issues.apache.org/jira/browse/SPARK-9685
当您使用带有以下签名的StructType.add
时:
add(name: String, dataType: String, nullable: Boolean)
dataType
字符串应对应于 .simpleString
或 .typeName
。对于IntegerType
,它是int
:
import org.apache.spark.sql.types._
IntegerType.simpleString
// String = int
或integer
:
IntegerType.typeName
// String = integer
所以你需要的是这样的东西:
val schema = StructType(Nil)
schema.add("foo", "int", true)
// org.apache.spark.sql.types.StructType =
// StructType(StructField(foo,IntegerType,true))
或
schema.add("foo", "integer", true)
// org.apache.spark.sql.types.StructType =
// StructType(StructField(foo,IntegerType,true))
如果你想通过IntegerType
它必须是DataType
而不是String
:
schema.add("foo", IntegerType, true)