有没有任何方法可以在不使用for循环的情况下获得列表中的每一项?
即给定数据
+----+---------+-------------+----------+-----------+
| id| date| revenue |con_dist_1| con_dist_2|
+----+---------+-------------+----------+-----------+
|3310|1/15/2018| 0.010680705| 6|0.019875458|
|3310|1/15/2018| 0.006628853| 4|0.816039063|
|3310|1/15/2018| 0.01378215| 4|0.082049528|
|3310|1/15/2018| 0.010680705| 6|0.019875458|
|3310|1/15/2018| 0.006628853| 4|0.816039063|
|3310|1/15/2018| 0.01378215| 4|0.082049528|
|3310|1/15/2018| 0.010680705| 6|0.019875458|
|3310|1/15/2018| 0.010680705| 6|0.019875458|
|3310|1/15/2018| 0.014933087| 5|0.034681906|
|3310|1/15/2018| 0.014448282| 3|0.082049528|
+----+---------+-------------+----------+-----------+
val col_list = Array("con_dist_1","con_dist_2")
val median_col_list = partitioned_data.stat.approxQuantile(col_list, Array(0.0,0.1,0.5),0.0)
val percentile_0 = 0;
val percentile_10 = 1;
val Q0 = median_col_list(col_list.indexOf("con_dist_1"))(percentile_0)
val Q10 =median_col_list(col_list.indexOf("con_dist_1"))(percentile_10)
如果不循环col_list,有什么方法可以计算percentile_0&对于col_list中的每个项目,percentile_10,我的意思是并行。。。使用地图什么的???
我将回答"如何为多列一次计算多个(近似(百分位数"?
根据DataFrameStatFunctions文档,签名
approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]
自2.2.0起(仅(可用。
如果你使用的是旧版本的Spark,它就不会有这个签名,而且做这个计算也不会那么容易。
这是一个使用Spark 2.4.0的数据示例。
val df = Seq((3310,"1/15/2018",0.010680705,6,0.019875458),(3310,"1/15/2018",0.006628853,4,0.816039063),(3310,"1/15/2018",0.01378215,4,0.082049528),(3310,"1/15/2018",0.010680705,6,0.019875458),(3310,"1/15/2018",0.006628853,4,0.816039063),(3310,"1/15/2018",0.01378215,4,0.082049528),(3310,"1/15/2018",0.010680705,6,0.019875458),(3310,"1/15/2018",0.010680705,6,0.019875458),(3310,"1/15/2018",0.014933087,5,0.034681906),(3310,"1/15/2018",0.014448282,3,0.082049528)).toDF("id","date","revenue","con_dist_1","con_dist_2")
df.stat.approxQuantile(Array("con_dist_1", "con_dist_2"), Array(0.1, 0.5), 0)
输出(第一个维度是列,第二个维度是请求的百分比,因此例如,con_dist_1
的第10个百分比是3.0(:
Array[Array[Double]] = Array(Array(3.0, 4.0), Array(0.019875458, 0.034681906))