我正在使用新的Apache Spark版本1.4.0 Data-frames API从Twitter的状态JSON中提取信息,主要集中在实体对象上-与此问题相关的部分如下所示:
{
...
...
"entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [
{
"screen_name": "linobocchini",
"name": "Lino Bocchini",
"id": 187356243,
"id_str": "187356243",
"indices": [ 3, 16 ]
},
{
"screen_name": "jeanwyllys_real",
"name": "Jean Wyllys",
"id": 111123176,
"id_str": "111123176",
"indices": [ 79, 95 ]
}
],
"symbols": []
},
...
...
}
关于如何从string
、integer
等基元类型中提取信息,有几个例子,但我找不到任何关于如何处理这些复杂结构的信息。
我尝试了下面的代码,但它仍然不起作用,它抛出了一个异常
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tweets = sqlContext.read.json("tweets.json")
// this function is just to filter empty entities.user_mentions[] nodes
// some tweets doesn't contains any mentions
import org.apache.spark.sql.functions.udf
val isEmpty = udf((value: List[Any]) => value.isEmpty)
import org.apache.spark.sql._
import sqlContext.implicits._
case class UserMention(id: Long, idStr: String, indices: Array[Long], name: String, screenName: String)
val mentions = tweets.select("entities.user_mentions").
filter(!isEmpty($"user_mentions")).
explode($"user_mentions") {
case Row(arr: Array[Row]) => arr.map { elem =>
UserMention(
elem.getAs[Long]("id"),
elem.getAs[String]("is_str"),
elem.getAs[Array[Long]]("indices"),
elem.getAs[String]("name"),
elem.getAs[String]("screen_name"))
}
}
mentions.first
尝试调用mentions.first
:时出现异常
scala> mentions.first
15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
scala.MatchError: [List([187356243,187356243,List(3, 16),Lino Bocchini,linobocchini], [111123176,111123176,List(79, 95),Jean Wyllys,jeanwyllys_real])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
at org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)
这里怎么了?我知道它与类型有关,但我还搞不清楚
作为附加上下文,自动映射的结构为:
scala> mentions.printSchema
root
|-- user_mentions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
注1:我知道使用HiveQL
可以解决这个问题,但一旦有这么多动量,我想使用数据帧。
SELECT explode(entities.user_mentions) as mentions
FROM tweets
注意2:自定义项val isEmpty = udf((value: List[Any]) => value.isEmpty)
是一个丑陋的破解,我在这里遗漏了一些东西,但这是我避免NPE 的唯一方法
这里有一个有效的解决方案,只需一个小技巧。
主要思想是通过声明List[String]而不是List[Row]来解决类型问题:
val mentions = tweets.explode("entities.user_mentions", "mention"){m: List[String] => m}
这创建了第二列,称为"结构"类型的"提及":
| entities| mention|
+--------------------+--------------------+
|[List(),List(),Li...|[187356243,187356...|
|[List(),List(),Li...|[111123176,111123...|
现在执行map()来提取提及中的字段。getStruct(1)调用获取每行第1列中的值:
case class Mention(id: Long, id_str: String, indices: Seq[Int], name: String, screen_name: String)
val mentionsRdd = mentions.map(
row =>
{
val mention = row.getStruct(1)
Mention(mention.getLong(0), mention.getString(1), mention.getSeq[Int](2), mention.getString(3), mention.getString(4))
}
)
并将RDD转换回DataFrame:
val mentionsDf = mentionsRdd.toDF()
给你!
| id| id_str| indices| name| screen_name|
+---------+---------+------------+-------------+---------------+
|187356243|187356243| List(3, 16)|Lino Bocchini| linobocchini|
|111123176|111123176|List(79, 95)| Jean Wyllys|jeanwyllys_real|
尝试这样做:
case Row(arr: Seq[Row]) => arr.map { elem =>