我使用的是带有scala 11的最新版本的spark xml(0.4.1(,当我读取到一些包含以"/>"结尾的标记的xml时,对应的值为空,请参阅示例:
XML:
<Clients>
<Client ID="1" name="teste1" age="10">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
<Client ID="2" name="teste2" age="20"/>
<Client ID="3" name="teste3" age="30">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
</Clients>
数据帧:
+----+------+----+--------------------+
| _ID| _name|_age| Operation|
+----+------+----+--------------------+
| 1|teste1| 10|[[1,operation1], ...|
|null| null|null| null|
+----+------+----+--------------------+
代码:
Dataset<Row> clients = sparkSession.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rowTag", "Client")
.schema(getSchemaClient())
.load(dirtorio);
clients.show(10);
public StructType getSchemaClient() {
return new StructType(
new StructField[] {
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
new StructField("_age", DataTypes.StringType, true, Metadata.empty()),
new StructField("Operation", DataTypes.createArrayType(this.getSchemaOperation()), true, Metadata.empty()) });
}
public StructType getSchemaOperation() {
return new StructType(new StructField[] {
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
});
}
0.5.0版本刚刚发布,它解决了自关闭标记的问题。它可能会解决这个问题。看见https://github.com/databricks/spark-xml/pull/352