iwrote pyspark
result =
df.select('*', date_format('window_start', 'yyyy-MM-dd hh:mm').alias('time_window'))
.groupby('time_window')
.agg({'total_score': 'sum'})
result.show()
我想让它用Scala语言和Spark一起使用我这样做了,我得到了我错误,我没有解开错误,因为scala
是新手val result=df.select('*', date_format(df("time_window"),"yyyy-MM-dd hh:mm").alias("time_window"))
.groupBy("time_window")
.agg(sum("total_score"))
错误说
超载方法值选择替代方案:[U1,U2](C1: org.apache.spark.sql.typedcolumn [org.apache.spark.sql.row,u1],c2: org.apache.spark.sql.typedcolumn [org.apache.spark.sql.row,u2])org.apache.spark.sql.dataset [(u1,u1, U2)](col:string,cols:string*)org.apache.spark.sql.dataframe (Cols: org.apache.spark.sql.column*)org.apache.spark.sql.dataframe不能是 应用于(char, org.apache.spark.sql.column)process.scala/process/src line 30 scala 问题
如何修复源代码以使其在Scala
这与您的pyspark代码相似
val data = spark.sparkContext.parallelize(Seq(
("2017-05-21", 1),
("2017-05-21", 1),
("2017-05-22", 1),
("2017-05-22", 1),
("2017-05-23", 1),
("2017-05-23", 1),
("2017-05-23", 1),
("2017-05-23", 1))).toDF("time_window", "foo")
data.withColumn("$time_window", date_format(data("time_window"),"yyyy-MM-dd hh:mm"))
.groupBy("$time_window")
.agg(sum("foo")).show