正如标题所述,我想用特定列的平均值减去该列的每个值。
这是我的代码尝试:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()
您可以使用Spark的DataFrame函数,也可以使用对DataFrame的SQL查询来聚合您关注的列(rating1
、rating2
(的平均值。
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
测试输入DataFramemoviePairs
的输出(具有良好的双倍精度损失,您可以在此处进行管理(:
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+