>我有一个这样的数据帧:
+---------+-----+----+-----------------------+
|label |value|unit|dateTime |
+---------+-----+----+-----------------------+
|Uiqcnt|475 | |2020-04-11T21:35:13.410|
|Uiqcnt|475 | |2020-04-11T21:35:13.910|
|Uiqcnt|475 | |2020-04-11T21:35:14.400|
|Uiqcnt|476 | |2020-04-11T21:35:14.910|
|Uiqcnt|476 | |2020-04-11T21:35:15.400|
|Uiqcnt|476 | |2020-04-11T21:35:15.910|
|Uiqcnt|477 | |2020-04-11T21:35:16.410|
|Uiqcnt|477 | |2020-04-11T21:35:16.910|
|Uiqcnt|477 | |2020-04-11T21:35:17.420|
|Uiqcnt|478 | |2020-04-11T21:35:17.920|
|Uiqcnt|478 | |2020-04-11T21:35:18.430|
我想按值获取时差分区。考虑到大量数据,我如何才能以最有效的方式做到这一点?
您可以按value
对数据集进行分组,并计算最小和最大日期。之后,您可以计算最小值和最大值之间的差异。我假设结果可以四舍五入到一秒,以便可以使用to_unixtimestamp
。
df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max"))
.withColumn("minUnix", F.unix_timestamp(F.col("min")))
.withColumn("maxUnix", F.unix_timestamp(F.col("max")))
.withColumn("diff", F.col("maxUnix") - F.col("minUnix"))
.select("value", "diff")
如果您还需要秒的小数部分,udf 可以提供帮助:
time_delta = F.udf(lambda min, max: (max-min).total_seconds(), FloatType())
df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max"))
.withColumn("diff", time_delta(F.col("min"),F.col("max")))
.show(truncate=False)
指纹
+-----+----------------------+----------------------+----+
|value|min |max |diff|
+-----+----------------------+----------------------+----+
|476 |2020-04-11 21:35:14.91|2020-04-11 21:35:15.91|1.0 |
|477 |2020-04-11 21:35:16.41|2020-04-11 21:35:17.42|1.01|
|478 |2020-04-11 21:35:17.92|2020-04-11 21:35:18.43|0.51|
|475 |2020-04-11 21:35:13.41|2020-04-11 21:35:14.4 |0.99|
+-----+----------------------+----------------------+----+