我需要将数据加载到具有一些XML和文本内容的spark数据框架。以下是我的数据格式。
1,2003,4349,<c><ab a="Roy" b="201"/><ab a="Joe" b="202"/></c>,54,M
我需要得到如下所示的最终输出。
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+
| Month|Year | pincode | name | id | manager_name|manager_id |dep_id |Gender |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+--------+--------+
|1 |2003 |4348 | Roy | 201 |Joe | 202 | 54 |M |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+
我们可以使用spark-xml库获得所需的结果。
import com.databricks.spark.xml._
import com.databricks.spark.xml.functions.from_xml
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val df = // Read csv file
// Assuming your xml content column name is xmldata
val xmlSchema = schema_of_xml(df.select("xmldata").as[String])
df.withColumn("xmldata", from_xml('xmldata, xmlSchema))
.select("*", "xmldata.ab")
.selectExpr(df.columns.diff(Array("xmldata")) ++
Array("ab[0]._a as name", "ab[0]._b as id", "ab[1]._a as manager_name", "ab[1]._b as manager_id"): _*)
.show(false)
/*
+-----+----+-------+------+------+----+---+------------+----------+
|Month|Year|pincode|dep_id|Gender|name|id |manager_name|manager_id|
+-----+----+-------+------+------+----+---+------------+----------+
|1 |2003|4349 |54 |M |Roy |201|Joe |202 |
+-----+----+-------+------+------+----+---+------------+----------+ /*