我在dataframe中有一个JSON字符串
aaa | bbb | ccc |ddd | eee
--------------------------------------
100 | xxxx | 123 |yyy|2017
100 | yyyy | 345 |zzz|2017
200 | rrrr | 500 |qqq|2017
300 | uuuu | 200 |ttt|2017
200 | iiii | 500 |ooo|2017
我想将结果作为
{100,[{xxxx:{123,yyy}},{yyyy:{345,zzz}}],2017}
{200,[{rrrr:{500,qqq}},{iiii:{500,ooo}}],2017}
{300,[{uuuu:{200,ttt}}],2017}
请帮助
这有效:
val df = data
.withColumn("cd", array('ccc, 'ddd)) // create arrays of c and d
.withColumn("valuesMap", map('bbb, 'cd)) // create mapping
.withColumn("values", collect_list('valuesMap) // collect mappings
.over(Window.partitionBy('aaa)))
.withColumn("eee", first('eee) // e is constant, just get first value of Window
.over(Window.partitionBy('aaa)))
.select("aaa", "values", "eee") // select only columns that are in the question selected
.select(to_json(struct("aaa", "values", "eee")).as("value")) // create JSON
确保您做
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._`
您可以创建一个用lit()
定义值为常数的映射或使用$"col_name"
中的其他列中从其他列中获取它们,例如:
val new_df = df.withColumn("map_feature", map(lit("key1"), lit("value1"), lit("key2"), $"col2"))