用com.databricks:spark-xml加载xml文件时出现Pyspark问题



我正试图推动一些学术POC与com.databricks:spark-xml一起使用pyspark。目标是加载Stack Exchange Data Dump xml格式(https://archive.org/details/stackexchange)到pyspark df。

它的工作原理就像一个具有正确标签的正确格式化xml的魅力,但在堆栈交换转储中失败,如下所示:

<users>
<row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>

根据根标记、行标记,我得到的是空模式或。。什么:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()
root
|-- row: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _AboutMe: string (nullable = true)
|    |    |-- _AccountId: long (nullable = true)
|    |    |-- _CreationDate: string (nullable = true)
|    |    |-- _DisplayName: string (nullable = true)
|    |    |-- _DownVotes: long (nullable = true)
|    |    |-- _Id: long (nullable = true)
|    |    |-- _LastAccessDate: string (nullable = true)
|    |    |-- _Location: string (nullable = true)
|    |    |-- _ProfileImageUrl: string (nullable = true)
|    |    |-- _Reputation: long (nullable = true)
|    |    |-- _UpVotes: long (nullable = true)
|    |    |-- _VALUE: string (nullable = true)
|    |    |-- _Views: long (nullable = true)
|    |    |-- _WebsiteUrl: string (nullable = true)
+--------------------+
|                 row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
火花:1.6.0Python:2.7.15Com.databricks:spark-xml_2.10:0.4.1

如有任何建议,我将不胜感激

谨致问候,P.

我一段时间前尝试了同样的方法(stackoverflow转储文件上的spark-xml(,但失败了。。。主要是因为DF被视为一组结构,处理性能非常差。相反,我建议在每一行中使用标准的文本读取器和映射Key="Value"的UDF,如下所示:

pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}

您也可以使用我的代码来获得正确的数据类型:https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb(该模式与2017年3月的转储相匹配(。

相关内容

  • 没有找到相关文章