Flink SQL:与分组依据的外部连接给出了意外的输出



我有两个Flink动态表EventConfiguration

Event有结构:[id, myTimestamp]Configuration有结构:id, myValue, myTimestamp

我正在尝试执行返回Event.id, Configuration.myValue的 Flink SQL 查询,或者Event.id, null如果EventidConfiguration中的任何id不匹配。

预期行为的示例(EventConfiguration从空开始(:

该示例必须读作:

[DATA_RECEIVED] => TARGET_TABLE : EXPECTED_OUTPUT

由于 SQL 查询是从联接创建的,因此它入到UpsertSink中(输出的第一个值对应于更新插入布尔值(

[myId-1, 10]            => EventTable           : [(true, myId-1, null)]
[myId-1, myValue-A, 15] => ConfigurationTable   : [(false, myId-1, null), (true, myId-1, myValue-A)]
[myId-1, myValue-A, 20] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, myValue-A)]
[myId-1, myValue-B, 25] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, myValue-B)]
[myId-1, 30]            => EventTable           : [(false, myId-1, null), (true, myId-1, myValue-B)]

所以我做了这个查询:

SELECT
Event.id,
Configuration.myValue
FROM
(SELECT id, MAX(myTimestamp) as myTimestamp FROM Event GROUP BY id) as Event
LEFT JOIN
(SELECT id, LATEST_VAL(myValue, myTimestamp) as myValue, MAX(myTimestamp) as myTimestamp FROM Configuration GROUP BY id, myValue) as Configuration
ON Event.id = Configuration.id
GROUP BY Event.id, Configuration.myValue

其中LATEST_VAL是一个 UDF,它返回与MAX(myTimestamp)关联的myValue

但我有我不理解的行为。以下是观察到的结果:

[myId-1, 10]            => EventTable           : [(true, myId-1, null)] // OK
[myId-1, myValue-A, 15] => ConfigurationTable   : [(false, myId-1, null), (true, myId-1, myValue-A)] // OK
[myId-1, myValue-A, 20] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, null), (false, myId-1, null), (true, myId-1, myValue-A)] // NOT OK
[myId-1, myValue-B, 25] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, null), (false, myId-1, null), (true, myId-1, myValue-B)] // NOT OK
[myId-1, 30]            => EventTable           : [(false, myId-1, null), (true, myId-1, myValue-B)] // OK

您如何解释预期行为和观察到的行为之间的区别?为什么有额外的输出(true, myId-1, null), (false, myId-1, null)

是否可以调整SQL查询以获得所需的行为?

注意:

  • 我正在使用 Flink 1.8

我认为您错过的一点是您实际上加入了两个撤回流。即使您的输入流仅附加流,您也会在产生收回的子查询中对它们执行聚合。

让我们首先分析子查询的结果:

子查询 1:

Query: SELECT id, MAX(myTimestamp) as myTimestamp FROM Event GROUP BY id
Resulting stream:
(true, myId-1, 10L)
(false, myId-1, 10L)
(true, myId-1, 30L)

子查询 2:

Query: SELECT id, LATEST_VAL(myValue, myTimestamp) as myValue, MAX(myTimestamp) as myTimestamp FROM Configuration GROUP BY id, myValue
Resulting stream:
(true, "myId-1", "myValue-A", 15L)
(false, "myId-1", "myValue-A", 15L)
(true, "myId-1", "myValue-A", 20L)
(false, "myId-1", "myValue-A", 20L)
(true, "myId-1", "myValue-B", 25L)

之后,您可以在这两个收回流之上执行联接和分组。考虑到这一点,在您的示例中实际连接和分组的内容是:

[true, myId-1, 10]             : [(true, myId-1, null)]
[true, myId-1, myValue-A, 15]  : [(false, myId-1, null), (true, myId-1, myValue-A)]
[false, myId-1, myValue-A, 15] : [(false, myId-1, myValue-A), (true, myId-1, null)]
[true, myId-1, myValue-A, 20]  : [(false, myId-1, null), (true, myId-1, myValue-A)]
[false, myId-1, myValue-A, 20] : [(false, myId-1, myValue-A), (true, myId-1, null)]
[true, myId-1, myValue-B, 25]  : [(false, myId-1, null), (true, myId-1, myValue-B)]
...

总的来说,据我所知,它产生了正确的结果。对于每个输入行,最后一个发出的行表示与给定 id 对应的最新值。

相关内容

  • 没有找到相关文章

最新更新