基于Date的DataFrame行复制



我有一个具有以下结构和数据的Dataframe

源:

Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 05/01/2021 
M, N, 10/01/2021

我想将其转换为以下格式(前两列的值被复制,日期被增加,直到下一个日期,如下所示:

Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
A, B, 05/01/2021 
A, B, 06/01/2021 
A, B, 07/01/2021 
A, B, 08/01/2021
A, B, 09/01/2021
M, N, 10/01/2021

关于如何在scala spark中实现这一点有任何想法吗?

下面是工作解决方案:

val dfp1 = List(("1001", 11, "01/10/2021"), ("1002", 21,  "05/10/2021"), ("1001", 12, "10/10/2021"),  ("1002", 22,  "15/10/2021")).toDF("SerialNumber","SomeVal", "Date")
val dfProducts = dfp1.withColumn("Date", to_date($"Date","dd/MM/yyyy"))
dfProducts.show
+------------+-------+----------+
|SerialNumber|SomeVal|      Date|
+------------+-------+----------+
|        1001|     11|2021-10-01|
|        1002|     21|2021-10-05|
|        1001|     12|2021-10-10|
|        1002|     22|2021-10-15|
+------------+-------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val overColumns = Window.partitionBy("SerialNumber").orderBy( "Date").rowsBetween(1, Window.unboundedFollowing)
val dfProduct1 = dfProducts.withColumn("NextSerialDate",first("Date", true).over(overColumns)).orderBy("Date")
dfProduct1.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal|      Date|NextSerialDate|
+------------+-------+----------+--------------+
|        1001|     11|2021-10-01|    2021-10-10|
|        1002|     21|2021-10-05|    2021-10-15|
|        1001|     12|2021-10-10|          null|
|        1002|     22|2021-10-15|          null|
+------------+-------+----------+--------------+
val dfProduct2= dfProduct1.withColumn("NextSerialDate",  when(col("NextSerialDate").isNull, col("Date")).otherwise(date_sub(col("NextSerialDate"), 1))).orderBy("SerialNumber")
dfProduct2.show
+------------+-------+----------+--------------+
|SerialNumber|SomeVal|      Date|NextSerialDate|
+------------+-------+----------+--------------+
|        1001|     11|2021-10-01|    2021-10-09|
|        1001|     12|2021-10-10|    2021-10-10|
|        1002|     21|2021-10-05|    2021-10-14|
|        1002|     22|2021-10-15|    2021-10-15|
+------------+-------+----------+--------------+

val dfProduct3= dfProduct2.withColumn("ExpandedDate",  explode_outer(sequence($"Date", $"NextSerialDate")))
dfProduct3.show
+------------+-------+----------+--------------+------------+
|SerialNumber|SomeVal|      Date|NextSerialDate|ExpandedDate|
+------------+-------+----------+--------------+------------+
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-01|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-02|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-03|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-04|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-05|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-06|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-07|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-08|
|        1001|     11|2021-10-01|    2021-10-09|  2021-10-09|
|        1001|     12|2021-10-10|    2021-10-10|  2021-10-10|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-05|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-06|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-07|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-08|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-09|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-10|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-11|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-12|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-13|
|        1002|     21|2021-10-05|    2021-10-14|  2021-10-14|
+------------+-------+----------+--------------+------------+
only showing top 20 rows
val dfProduct4 = dfProduct3.drop("Date", "NextSerialDate").withColumn("Date",  col("ExpandedDate")).drop("ExpandedDate")
dfProduct4.show(50, false)
+------------+-------+----------+
|SerialNumber|SomeVal|Date      |
+------------+-------+----------+
|1001        |11     |2021-10-01|
|1001        |11     |2021-10-02|
|1001        |11     |2021-10-03|
|1001        |11     |2021-10-04|
|1001        |11     |2021-10-05|
|1001        |11     |2021-10-06|
|1001        |11     |2021-10-07|
|1001        |11     |2021-10-08|
|1001        |11     |2021-10-09|
|1001        |12     |2021-10-10|
|1002        |21     |2021-10-05|
|1002        |21     |2021-10-06|
|1002        |21     |2021-10-07|
|1002        |21     |2021-10-08|
|1002        |21     |2021-10-09|
|1002        |21     |2021-10-10|
|1002        |21     |2021-10-11|
|1002        |21     |2021-10-12|
|1002        |21     |2021-10-13|
|1002        |21     |2021-10-14|
|1002        |22     |2021-10-15|
+------------+-------+----------+

相关内容

  • 没有找到相关文章

最新更新