在 spark scala/python 中动态地将日期转换为时间戳 [不提及日期格式]



嗨,需要将原始日期转换为时间戳

数据

id,date,date1,date2,date3
1,161129,19960316,992503,20140205
2,961209,19950325,992206,20140503
3,110620,19960522,991610,20131302
4,160928,19930506,992205,20160112
5,021002,20000326,991503,20131112
6,160721,19960909,991212,20151511
7,160721,20150101,990809,20140809
8,100903,20151212,990605,20011803
9,070713,20170526,990702,19911010 

在这里,我有"日期","日期1","日期2"和"日期3"列,其中日期为字符串格式。 通常我使用 unix_timestamp("<col>","<formate>").cast("timestamp") 转换原始日期,但现在我不想要提及格式,我想要动态方法,因为以后可能会将更多列添加到我的表中。 在这种情况下,静态方法不会发挥最佳作用。

在某些列中,我们将有 6 个字符的日期,其中前 2 个字符代表"年",接下来的 4 个字符代表"日期"和"月份",即 yyddmm 或YYMMDD。

在其他一些列中,我们将有 8 个字符的日期,其中前 4 个字符代表"年",接下来的 4 个字符代表"日期"和"月份",即 yyyyddmm 或 yyyymmdd。

我们对每列都有相同的格式,需要动态查找并将其转换为时间戳,而无需硬编码。

输出应为时间戳。

+---+-------------------+-------------------+-------------------+-------------------+
| id|               date|              date1|              date2|              date3|
+---+-------------------+-------------------+-------------------+-------------------+
|  1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
|  2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
|  3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
|  4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
|  5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
|  6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
|  7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
|  8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
|  9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+

这里有上述要求。给定UDF中的一些条件来查找每个日期列的格式。

def udf_1(x:String):
    if len(x)==6 and int(x[-2:]) > 12: return "yyMMdd"
    elif len(x)==8 and int(x[-2:]) > 12: return "yyyyMMdd"
    elif len((x))==6 and int(x[2:4]) <12 and int(x[-2:]) >12: return "yyMMdd"
    elif len((x))==8 and int(x[4:6]) <12 and int(x[-2:]) >12: return "yyyyMMdd"
    elif len((x))==6 and int(x[2:4]) >12 and int(x[-2:]) <12: return "yyddMM"
    elif len((x))==8 and int(x[4:6]) >12 and int(x[-2:]) <12: return "yyyyddMM"
    elif len((x))==6 and int(x[2:4]) <=12 and int(x[-2:]) <=12: return "N"
    elif len((x))==8 and int(x[4:6]) <=12 and int(x[-2:]) <=12: return "NA"
    else: return "null"
udf_2 = udf(udf_1, StringType())
c1 = c.withColumn("date_formate",udf_2("date"))
c2 = c1.withColumn("date1_formate",udf_2("date1"))
c3 = c2.withColumn("date2_formate",udf_2("date2"))
c4 = c3.withColumn("date3_formate",udf_2("date3"))
c4.show()

指定的条件下,我提取了一些行的格式,在日期和月份为 <= 12 的情况下,我为 6 个字符指定了"N",为 8 个字符给出了"NA"。

+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|  date|   date1| date2|    date3| id|date_formate|date1_formate|date2_formate|date3_formate|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|161129|19960316|992503| 20140205|  1|      yyMMdd|     yyyyMMdd|       yyddMM|           NA|
|961209|19950325|992206| 20140503|  2|           N|     yyyyMMdd|       yyddMM|           NA|
|110620|19960522|991610| 20131302|  3|      yyMMdd|     yyyyMMdd|       yyddMM|     yyyyddMM|
|160928|19930506|992205| 20160112|  4|      yyMMdd|           NA|       yyddMM|           NA|
|021002|20000326|991503| 20131112|  5|           N|     yyyyMMdd|       yyddMM|           NA|
|160421|19960909|991212| 20151511|  6|      yyMMdd|           NA|            N|     yyyyddMM|
|160721|20150101|990809| 20140809|  7|      yyMMdd|           NA|            N|           NA|
|100903|20151212|990605| 20011803|  8|           N|           NA|            N|     yyyyddMM|
|070713|20170526|990702|19911010 |  9|      yyMMdd|     yyyyMMdd|            N|     yyyyddMM|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+

现在,我已经采用了提取的格式并将其存储在一个变量中,并在unix_timestamp中调用该变量以将原始日期转换为时间戳。

r1 = c4.where(c4.date_formate != ('NA' or 'N'))[['date_formate']].first().date_formate
t_s = unix_timestamp("date",r1).cast("timestamp")
c5=c4.withColumn("date",t_s)
r2 = c5.where(c5.date1_formate != ('NA' or 'N'))[['date1_formate']].first().date1_formate
t_s1 = unix_timestamp("date1",r2).cast("timestamp")
c6 = c5.withColumn("date1",t_s1)
r3 = c6.where(c6.date2_formate != ('NA' or 'N'))[['date2_formate']].first().date2_formate
t_s2 = unix_timestamp("date2",r3).cast("timestamp")
c7 = c6.withColumn("date2",t_s2)
r4 = c7.where(c7.date3_formate != ('NA' or 'N'))[['date3_formate']].first().date3_formate
t_s3 = unix_timestamp("date3",r4).cast("timestamp")
c8 = c7.withColumn("date3",t_s3)
c8.select("id","date","date1","date2","date3").show()

输出

+---+-------------------+-------------------+-------------------+-------------------+
| id|               date|              date1|              date2|              date3|
+---+-------------------+-------------------+-------------------+-------------------+
|  1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
|  2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
|  3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
|  4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
|  5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
|  6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
|  7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
|  8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
|  9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+

最新更新