解析包含JSON数据的RDD

我有一个带有以下数据的JSON文件：

{"year":"2016","category":"physics","laureates":[{"id":"928","firstname":"David J.","surname":"Thouless","motivation":""for theoretical discoveries of topological phase transitions and topological phases of matter"","share":"2"},{"id":"929","firstname":"F. Duncan M.","surname":"Haldane","motivation":""for theoretical discoveries of topological phase transitions and topological phases of matter"","share":"4"},{"id":"930","firstname":"J. Michael","surname":"Kosterlitz","motivation":""for theoretical discoveries of topological phase transitions and topological phases of matter"","share":"4"}]}
{"year":"2016","category":"chemistry","laureates":[{"id":"931","firstname":"Jean-Pierre","surname":"Sauvage","motivation":""for the design and synthesis of molecular machines"","share":"3"},{"id":"932","firstname":"Sir J. Fraser","surname":"Stoddart","motivation":""for the design and synthesis of molecular machines"","share":"3"},{"id":"933","firstname":"Bernard L.","surname":"Feringa","motivation":""for the design and synthesis of molecular machines"","share":"3"}]}

我需要将RDD作为钥匙值对返回，其中我将类别作为诺贝尔奖获得者的钥匙和姓氏作为值。我怎么可能使用转换？

对于给定的数据集应该是：

"physics"-"Thouless","haldane","Kosterlitz"
"chemistry"-"Sauvage","Stoddart","Feringa"

您是否仅约束对RDD？如果您可以使用DataFrames，那么加载非常简单，您会得到一个架构，爆炸嵌套字段，组，然后在其余部分中使用RDD。这是您可以做到的一种方式

将JSON加载到数据框架中，您也可以确认您的架构

>>> nobelDF = spark.read.json('/user/cloudera/nobel.json')
>>> nobelDF.printSchema()
root
 |-- category: string (nullable = true)
 |-- laureates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstname: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- motivation: string (nullable = true)
 |    |    |-- share: string (nullable = true)
 |    |    |-- surname: string (nullable = true)
 |-- year: string (nullable = true)

现在您可以爆炸嵌套数组，然后转换为RDD，您可以在其中分组

nobelRDD = nobelDF.select('category', explode('laureates.surname')).rdd

仅是一个fyi，爆炸的数据帧看起来像

+---------+----------+
| category|       col|
+---------+----------+
|  physics|  Thouless|
|  physics|   Haldane|
|  physics|Kosterlitz|
|chemistry|   Sauvage|
|chemistry|  Stoddart|
|chemistry|   Feringa|
+---------+----------+

现在按类别组小组

from pyspark.sql.functions import collect_list
nobelRDD = nobelDF.select('category', explode('laureates.surname')).groupBy('category').agg(collect_list('col').alias('sn')).rdd
nobelRDD.collect()

现在您获得了所需的RDD，尽管它仍然是行对象（我添加了新行以显示完整行）

>>> for n in nobelRDD.collect():
...     print n
...
Row(category=u'chemistry', sn=[u'Sauvage', u'Stoddart', u'Feringa'])
Row(category=u'physics', sn=[u'Thouless', u'Haldane', u'Kosterlitz'])

，但这将是一个简单的地图来获取元组（我添加了新行以显示完整的行）

>>> nobelRDD.map(lambda x: (x[0],x[1])).collect()
[(u'chemistry', [u'Sauvage', u'Stoddart', u'Feringa']), 
 (u'physics', [u'Thouless', u'Haldane', u'Kosterlitz'])]

相关内容

最新更新

热门标签：