如何从CSV条目中提取XML数据



我需要将数据加载到具有一些XML和文本内容的spark数据框架。以下是我的数据格式。

1,2003,4349,<c><ab a="Roy" b="201"/><ab a="Joe" b="202"/></c>,54,M

我需要得到如下所示的最终输出。

+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+
|   Month|Year          | pincode            |        name |      id      |  manager_name|manager_id   |dep_id |Gender   |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+--------+--------+
|1       |2003          |4348                |      Roy    |   201        |Joe           | 202         |    54 |M        |
+--------+--------------+--------------------+-------------+--------------+--------------+-------------+-------+---------+

我们可以使用spark-xml库获得所需的结果。

import com.databricks.spark.xml._
import com.databricks.spark.xml.functions.from_xml
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
val df = // Read csv file

// Assuming your xml content column name is xmldata
val xmlSchema = schema_of_xml(df.select("xmldata").as[String])
df.withColumn("xmldata", from_xml('xmldata, xmlSchema))
.select("*", "xmldata.ab")
.selectExpr(df.columns.diff(Array("xmldata")) ++
Array("ab[0]._a as name", "ab[0]._b as id", "ab[1]._a as manager_name", "ab[1]._b as manager_id"): _*)
.show(false)
/*
+-----+----+-------+------+------+----+---+------------+----------+
|Month|Year|pincode|dep_id|Gender|name|id |manager_name|manager_id|
+-----+----+-------+------+------+----+---+------------+----------+
|1    |2003|4349   |54    |M     |Roy |201|Joe         |202       |
+-----+----+-------+------+------+----+---+------------+----------+ /*

最新更新