我想加载一个xml文件作为字符串然后我想对它做一些xpath操作
下面工作
df=spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>']],['value'])
df.printSchema()
df=df.selectExpr("xpath(value,'note/to/text()')")
现在我试着把XML放到一个文件中并将其作为文本加载然后对它进行类似的操作
xml_file="\path to the file,contents are exactly same as above example"
df=spark.read.option("wholetext", True).text(xml_file)
df=df.selectExpr('xpath(value,"note/to/text()")')
df.show()
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 33) (10.191.197.4 executor 0): java.lang.RuntimeException: Error loading expression 'note/to/text()'
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Premature end of file
请有人帮助,完全相同的操作失败时,试图从文件读取。我不想将文件读取为xml,由于项目要求,我必须将整个xml加载为字符串,然后进行xpath操作以提取特定的标签
请建议
你得到过早结束的最可能的原因是XML是在多行中,所以当读取它分成多行时,spark无法识别标记的开始和结束位置所以尝试将文件中的文本放在单行中,然后再使用Xpath