希望根据单独DF的值减去行中的每个值

  • 本文关键字:单独 DF 希望 scala apache-spark
  • 更新时间 :
  • 英文 :


正如标题所述,我想用特定列的平均值减去该列的每个值。

这是我的代码尝试:

val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))

val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()

您可以使用Spark的DataFrame函数,也可以使用对DataFrame的SQL查询来聚合您关注的列(rating1rating2(的平均值。

val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()

测试输入DataFramemoviePairs的输出(具有良好的双倍精度损失,您可以在此处进行管理(:

+-------------------+-------+-------+-------------------+-------------------+
|              movie|rating1|rating2|         meanDeltaX|         meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
|          Moonlight|      7|      8| 0.5999999999999996|                3.2|
| Lord Of The Drinks|     10|      1| 3.5999999999999996|               -3.8|
|The Disaster Artist|      3|      5|-3.4000000000000004|0.20000000000000018|
|          Airplane!|      7|      9| 0.5999999999999996|                4.2|
|               2001|      5|      1|-1.4000000000000004|               -3.8|
+-------------------+-------+-------+-------------------+-------------------+

最新更新