我正在尝试使用 com.databricks.spark.xml
Dataset<Row> df = spark.read().format("com.databricks.spark.xml")
.option("rowTag", "row").load("../1000.xml");
df.show(10);
我得到的输出如下
||
我错过了什么吗?
这是我的样本XML行。
<row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="316" Body="<p>An explicit cast to double isn't necessary.</p>

<pre><code>double trans = (double)trackBar1.Value / 5000.0;
</code></pre>

<p>Identifying the constant as <code>5000.0</code> (or as <code>5000d</code>) is sufficient:</p>

<pre><code>double trans = trackBar1.Value / 5000.0;
double trans = trackBar1.Value / 5000d;
</code></pre>
" />
非常感谢。
尝试在模式中使用XML属性名称之前使用_
符号。如果它不起作用 - 尝试使用@
符号。观看示例,但为旧火花版提供了。
XML数据问题。尝试以示例XML数据:
<row id="7">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</row>
使用您的代码示例:
Dataset<Row> df = spark.read().format("com.databricks.spark.xml")
.option("rowTag", "row").load("../1000.xml");
提供自定义模式:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType};
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("author", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.schema(customSchema)
.load("books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.save("newbooks.xml")
请参阅Databricks XML文档