我有 2 个数据框:dataframe1
有 70000 行,如下所示:
location_id, location, flag
1,Canada,active
2,Paris,active
3,London,active
4,Berlin,active
第二个 dflookup
修改了每个位置的 ID(此数据框会不时修改),例如:
id,location
1,Canada
10,Paris
4,Berlin
3,London
我的问题是,我需要从lookup
中获取新 id 作为location_id,如果location_id
与id
不同,则将具有标志名称的相应位置的旧 id 保留为非活动状态(以维护历史数据)和带有标志名称的新 id 作为每个位置的活动。因此,hive 中的输出表应如下所示:
location_id,location,flag
1,Canada,active
2,Paris,inactive
10,Paris,active
3,London,active
4,Berlin,active
我尝试先连接两个框架。然后在加入 DF 上,我正在执行操作,将所有记录保存在 hive 中。我尝试了以下操作:
val joinedFrame = dataframe1.join(lookup, "location")
val df_temp = joinedFrame.withColumn("flag1", when($"tag_id" === $"tag_number", "active").otherwise("inactive"))
var count = 1
df_temp.foreach(x => {
val flag1 = x.getAs[String]("flag1").toString
val flag = x.getAs[String]("flag").toString
val location_id = x.getAs[String]("location_id").toString
val location = x.getAs[String]("location").toString
val id = x.getAs[String]("id").toString
if ((count != 1)&&(flag1 != flag)){
println("------not equal-------",flag1,"-------",flag,"---------",id,"---------",location,"--------",location_id)
val df_main = sc.parallelize(Seq((location_id, location,flag1), (id, location, flag))).toDF("location_id", "location", "flag")
df_main.show
df_main.write.insertInto("location_coords")
}
count += 1
})
它打印具有不同 ID 的位置值,但在将这些值另存为数据帧时,我遇到了异常:
not equal------inactive------active---10---------Paris---------2
17/09/29 03:43:29 ERROR Executor: Exception in task 0.0 in stage 25.0 (TID 45)
java.lang.NullPointerException
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:75)
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/09/29 03:43:29 WARN TaskSetManager: Lost task 0.0 in stage 25.0 (TID 45, localhost, executor driver): java.lang.NullPointerException
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:75)
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
根据您的评论,我认为最简单的方法是在 id 上使用join
。执行外部连接时,缺少的列最终将具有 null,这些行是已更新且您感兴趣的行。
之后剩下的就是更新位置列,以防它是空的,以及标志列,请参阅下面的代码(请注意,我稍微更改了列名称):
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq((1,"Canada","active"),(2,"Paris","active"),(3,"London","active"),(4,"Berlin","active"))
.toDF("id", "location", "flag")
val df2 = Seq((1,"Canada"),(10,"Paris"),(4,"Berlin"),(3,"London"))
.toDF("id", "location_new")
val df3 = df.join(df2, Seq("id"), "outer")
.filter($"location".isNull or $"location_new".isNull)
.withColumn("location", when($"location_new".isNull, $"location").otherwise($"location_new"))
.withColumn("flag", when($"location" === $"location_new", "active").otherwise("inactive"))
.drop("location_new")
> df3.show()
+---+--------+--------+
| id|location| flag|
+---+--------+--------+
| 10| Paris| active|
| 2| Paris|inactive|
+---+--------+--------+
在此之后,可以使用此新数据帧更新配置单元表。