数据砖块在读取以 "/>" 结尾的标记时会引发 Spark-xml,返回值为 null



我使用的是带有scala 11的最新版本的spark xml(0.4.1(,当我读取到一些包含以"/>"结尾的标记的xml时,对应的值​​为空,请参阅示例:

XML:

<Clients>
<Client ID="1" name="teste1" age="10">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
<Client ID="2" name="teste2" age="20"/>
<Client ID="3" name="teste3" age="30">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
</Clients>

数据帧:

+----+------+----+--------------------+
| _ID| _name|_age|           Operation|
+----+------+----+--------------------+
|   1|teste1|  10|[[1,operation1], ...|
|null|  null|null|                null|
+----+------+----+--------------------+

代码:

Dataset<Row> clients = sparkSession.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rowTag", "Client")
.schema(getSchemaClient())
.load(dirtorio);
clients.show(10);
public StructType getSchemaClient() {
return new StructType(
new StructField[] { 
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
new StructField("_age", DataTypes.StringType, true, Metadata.empty()),
new StructField("Operation", DataTypes.createArrayType(this.getSchemaOperation()), true, Metadata.empty()) });
}
public StructType getSchemaOperation() {
return new StructType(new StructField[] {
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
});
}

0.5.0版本刚刚发布,它解决了自关闭标记的问题。它可能会解决这个问题。看见https://github.com/databricks/spark-xml/pull/352

最新更新