基于count遍历XML并使用spark scala创建ArrayString



通过遍历循环解析XML,并使用循环创建字符串数组

<USR_ORD><OrderResponse>
<OrderCount1>3</OrderCount1>
<OrderResult><orders>
<order>
<name>A</name><address>AAA</address><number>A1</number><status></status>
</order>
<order>
<name>B</name><number>B1</number>
</order>
<order>
<name>C</name><address>CCC</address><number>C1</number><status></status>
</order>
</orders></OrderResult>
</OrderResponse></USR_ORD>

我的代码如下

//creating list
val myList=List((100,1,"<USR_ORD><OrderResponse><OrderCount1>3</OrderCount1><OrderResult><orders><order><name>A</name><address>AAA</address><number>A1</number><status></status></order><order><name>B</name><number>B1</number></order><order><name>C</name><address>CCC</address><number>C1</number><status></status></order></orders></OrderResult></OrderResponse></USR_ORD>"))
//creating dataframe and temp table
val rdd = spark.sparkContext.parallelize(myList);
val DF1 = rdd.toDF("customer_id","response_id","response_output")
DF1.createOrReplaceTempView("ord_tbl");
spark.sql("""select * from ord_tbl""").show(10,false)
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output                                                                                                                                                                                                                                                                                                                                         |
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100        |1          |<USR_ORD><OrderResponse><OrderCount1>3</OrderCount1><OrderResult><orders><order><name>A</name><address>AAA</address><number>A1</number><status></status></order><order><name>B</name><number>B1</number></order><order><name>C</name><address>CCC</address><number>C1</number><status></status></order></orders></OrderResult></OrderResponse></USR_ORD>|
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

使用xpath标准函数

spark.sql("""select xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/name/text()') as name
,xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/address/text()') as address
, xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/orders/order/status/text()') as status from ord_tbl""").show(10,false)
+---------+----------+------+
|name     |address   |status|
+---------+----------+------+
|[A, B, C]|[AAA, CCC]|[]    |
+---------+----------+------+

但是期望的数据帧应该像下面

+---------+----------+------+
|name     |address   |status|
+---------+----------+------+
|[A, B, C]|[AAA,,CCC]|[,,]  |
+---------+----------+------+

我尝试使用循环作为参考,我确信这是错误的,没有编译

def (inorders:Int,inOrderCount:int,partxpathstring1:String,partxpathstring2:String,)
val orders=inorders
val OrderCount=inOrderCount
var i=0
var j=0
for (i <- 1 to orders){
for(j <- 1 to OrderCount){
fullxpath=xpath_string(response_output,'$partxpathstring1+[i]+$partxpathstring2+[j]')
fullxpath+=fullxpath
)
j+=1
}
i+1
}

Spark的xpath函数似乎过滤XML节点中的空值。您可能需要使用UDF来处理这个问题。下面是使用scala.xml.XML解析列response_output的示例:

val parse_orders = udf((response: String) => {
val xml = scala.xml.XML.loadString(response)
val orderCount = (xml \ "USR_ORD"  "OrderResponse"  "OrderCount1").text
val orders = xml \ "USR_ORD"  "OrderResponse"  "OrderResult"  "orders"  "order"
val orderInfo = Seq("name", "address", "number", "status").map { node =>
(node -> (orders  node).map(_.text))
}.toMap
(orderCount, orderInfo)
})

val df1 = df.withColumn("parsed", parse_orders(col("response_output")))
.select(
col("customer_id"),
col("response_id"),
col("parsed._1").as("orderCount"),
col("parsed._2.name").as("name"),
col("parsed._2.address").as("address"),
col("parsed._2.number").as("number"),
col("parsed._2.status").as("status")
)  
df1.show(false)
//+-----------+-----------+----------+---------+------------+------------+------+
//|customer_id|response_id|orderCount|name     |address     |number      |status|
//+-----------+-----------+----------+---------+------------+------------+------+
//|100        |1          |3         |[A, B, C]|[AAA, , CCC]|[A1, B1, C1]|[, , ]|
//+-----------+-----------+----------+---------+------------+------------+------+

相关内容

  • 没有找到相关文章

最新更新