我有RDD[Row]:
|---itemId----|----Country-------|---Type----------|
| 11 | US | Movie |
| 11 | US | TV |
| 101 | France | Movie |
如何做GroupBy itemId,以便我可以将结果保存为json列表,其中每一行都是单独的json对象(RDD中的每一行):
{"itemId" : 11,
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101,
"Country": {"France" :1 },"Type": {"Movie" :1} }
抽样:
I tried:
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
输入是json列表,其中每行都是json,我使用MappingUtils映射到POJO类CountryInfo,它负责json解析和转换:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val type = countryInfo.getType()
Row(itemId, country, type)
})
谁能告诉我怎么才能做到这一点?
谢谢!
我没有多余的时间来完成这件事,但我可以给你一个开始。
这个想法是,您将RDD[Row]
聚合到一个表示JSON结构的Map中。聚合是需要两个函数形参的折叠:
-
seqOp
如何将元素集合折叠成目标类型 -
combOp
如何合并两个目标类型
在合并时combOp
中出现了棘手的部分,因为您需要累积seqOp
中看到的值的计数。我把这个留作练习,因为我要赶飞机!如果你有困难,希望别人能帮你填补空白。
case class Row(id: Int, country: String, tpe: String)
def foo: Unit = {
val rows: RDD[Row] = ???
def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
acc.get(r.id) match {
case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
case Some((countries, types)) =>
val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
acc.updated(r.id, (countries_, types_))
}
}
val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
l.foldLeft(z) { case (acc, (id, (countries, types))) =>
r.get(id) match {
case None => acc.updated(id, (countries, types))
case Some(otherCountries, otherTypes) =>
// todo - continue by merging countries with otherCountries
// and types with otherTypes, then update acc
}
}
}
val summaryMap = rows.aggregate(z) { seqOp, combOp }