Spark 结构化流式处理是否可以进行正确的事件时间会话化?



一直在玩Spark Structured Streaming andmapGroupsWithState(特别是遵循Spark源代码中的StructuredSessionization示例(。我想确认一些我认为mapGroupsWithState存在的限制,鉴于我的用例。

就我而言,会话是用户的一组不间断活动,因此没有两个按时间顺序排列(按事件时间,而不是处理时间(的事件之间的间隔超过开发人员定义的某个持续时间(通常为 30 分钟(。

在跳入代码之前,一个示例将有所帮助:

{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}
{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}

对于上面的流,会话定义为 30 分钟的非活动时间。在流上下文中,我们应该以一个会话结束(第二个会话尚未完成(:

[
{
"user_id": "mike",
"startTimestamp": "2018-01-01T00:00:00",
"endTimestamp": "2018-01-01T00:05:00"
}
]

现在考虑以下 Spark 驱动程序:

import java.sql.Timestamp
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout}
object StructuredSessionizationV2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("StructredSessionizationRedux")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
import spark.implicits._
implicit val ctx = spark.sqlContext
val input = MemoryStream[String]
val EVENT_SCHEMA = new StructType()
.add($"event_time".string)
.add($"user_id".string)
val events = input.toDS()
.select(from_json($"value", EVENT_SCHEMA).alias("json"))
.select($"json.*")
.withColumn("event_time", to_timestamp($"event_time"))
.withWatermark("event_time", "1 hours")
events.printSchema()
val sessionized = events
.groupByKey(row => row.getAs[String]("user_id"))
.mapGroupsWithState[SessionState, SessionOutput](GroupStateTimeout.EventTimeTimeout) {
case (userId: String, events: Iterator[Row], state: GroupState[SessionState]) =>
println(s"state update for user ${userId} (current watermark: ${new Timestamp(state.getCurrentWatermarkMs())})")
if (state.hasTimedOut) {
println(s"User ${userId} has timed out, sending final output.")
val finalOutput = SessionOutput(
userId = userId,
startTimestampMs = state.get.startTimestampMs,
endTimestampMs = state.get.endTimestampMs,
durationMs = state.get.durationMs,
expired = true
)
// Drop this user's state
state.remove()
finalOutput
} else {
val timestamps = events.map(_.getAs[Timestamp]("event_time").getTime).toSeq
println(s"User ${userId} has new events (min: ${new Timestamp(timestamps.min)}, max: ${new Timestamp(timestamps.max)}).")
val newState = if (state.exists) {
println(s"User ${userId} has existing state.")
val oldState = state.get
SessionState(
startTimestampMs = math.min(oldState.startTimestampMs, timestamps.min),
endTimestampMs = math.max(oldState.endTimestampMs, timestamps.max)
)
} else {
println(s"User ${userId} has no existing state.")
SessionState(
startTimestampMs = timestamps.min,
endTimestampMs = timestamps.max
)
}
state.update(newState)
state.setTimeoutTimestamp(newState.endTimestampMs, "30 minutes")
println(s"User ${userId} state updated. Timeout now set to ${new Timestamp(newState.endTimestampMs + (30 * 60 * 1000))}")
SessionOutput(
userId = userId,
startTimestampMs = state.get.startTimestampMs,
endTimestampMs = state.get.endTimestampMs,
durationMs = state.get.durationMs,
expired = false
)
}
}
val eventsQuery = sessionized
.writeStream
.queryName("events")
.outputMode("update")
.format("console")
.start()
input.addData(
"""{"event_time": "2018-01-01T00:00:00", "user_id": "mike"}""",
"""{"event_time": "2018-01-01T00:01:00", "user_id": "mike"}""",
"""{"event_time": "2018-01-01T00:05:00", "user_id": "mike"}"""
)
input.addData(
"""{"event_time": "2018-01-01T00:45:00", "user_id": "mike"}"""
)
eventsQuery.processAllAvailable()
}
case class SessionState(startTimestampMs: Long, endTimestampMs: Long) {
def durationMs: Long = endTimestampMs - startTimestampMs
}
case class SessionOutput(userId: String, startTimestampMs: Long, endTimestampMs: Long, durationMs: Long, expired: Boolean)
}

该程序的输出是:

root
|-- event_time: timestamp (nullable = true)
|-- user_id: string (nullable = true)
state update for user mike (current watermark: 1969-12-31 19:00:00.0)
User mike has new events (min: 2018-01-01 00:00:00.0, max: 2018-01-01 00:05:00.0).
User mike has no existing state.
User mike state updated. Timeout now set to 2018-01-01 00:35:00.0
-------------------------------------------
Batch: 0
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
|  mike|   1514782800000| 1514783100000|    300000|  false|
+------+----------------+--------------+----------+-------+
state update for user mike (current watermark: 2017-12-31 23:05:00.0)
User mike has new events (min: 2018-01-01 00:45:00.0, max: 2018-01-01 00:45:00.0).
User mike has existing state.
User mike state updated. Timeout now set to 2018-01-01 01:15:00.0
-------------------------------------------
Batch: 1
-------------------------------------------
+------+----------------+--------------+----------+-------+
|userId|startTimestampMs|endTimestampMs|durationMs|expired|
+------+----------------+--------------+----------+-------+
|  mike|   1514782800000| 1514785500000|   2700000|  false|
+------+----------------+--------------+----------+-------+

根据我的会话定义,第二批中的单个事件触发会话状态过期,从而触发新会话。但是,由于水印 (2017-12-31 23:05:00.0( 尚未通过状态的超时 (2018-01-01 00:35:00.0(,因此状态未过期,并且事件被错误地添加到现有会话中,尽管自上一批中的最新时间戳以来已经过去了 30 多分钟。

我认为会话状态过期工作的唯一方法是,如果批处理中收到来自不同用户的足够多的事件,以使水印超过mike的状态超时。

我想人们也可以弄乱流的水印,但我想不出我会如何做到这一点来完成我的用例。

这准确吗?我在如何在 Spark 中正确执行基于事件时间的会话化方面是否缺少任何内容?

如果水印间隔大于会话间隔持续时间,您提供的实现似乎不起作用。

要使显示的逻辑正常工作,您需要将水印间隔设置为 <30 分钟。

如果您确实希望水印间隔与会话间隔持续时间无关(或大于(,则需要等到水印通过(水印 + 间隙(才能使状态过期。合并逻辑似乎盲目地合并窗口。这应该在合并之前考虑缺口持续时间。

编辑:我想我需要回答特定的原点问题,而不是提供完整的解决方案。

为了添加 Arun 的答案,map/flatMapGroupsWithState 的状态函数首先使用事件调用,然后使用超时状态调用。根据其工作原理,代码将重置超时,而状态应在此批处理中超时。

因此,虽然您可以利用超时功能来调用状态函数,即使事件不包含此类键,您仍然需要手动处理当前水印。这就是为什么我将超时设置为最早会话的会话结束时间戳,并在调用后处理所有逐出。

——

您可以参考下面的代码块,了解如何通过flatMapGroupsWithState实现带有事件时间和水印的会话窗口。

注意:我没有清理代码,并尝试支持两种输出模式,因此一旦您决定了输出模式,您就可以删除不相关的代码以使其更简单。

编辑2:我对flatMapGroupsWithState有错误的假设,事件不能保证被排序。

刚刚更新了代码:https://gist.github.com/HeartSaVioR/9a3aeeef0f1d8ee97516743308b14cd6#file-eventtimesessionwindowimplementationviaflatmapgroupswithstate-scala-L32-L189

从Spark 3.2.0开始,Spark原生支持会话窗口。

https://databricks.com/blog/2021/10/12/native-support-of-session-window-in-spark-structured-streaming.html

最新更新