用非空值填充set的空行



注意:我已经讨论了以下相关问题,这些问题不涉及我的查询

  • SQL:如何在一列中选择每一组具有重复值的行?
  • 用红移中的第一个非空值填充缺失值

我有一个像这样的稀疏的,不干净的数据集

| id  | operation | title    | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start     |          |              |      |
| abc | Start     | recovery |              | Link |
| abc | Start     | recovery | SMS          |      |
| abc | Set       |          | Email        |      |
| abc | Verify    |          | Email        |      |
| pqr | Start     |          |              | OTP  |
| pqr | Verfiy    | sign_in  | Push         |      |
| pqr | Verify    |          |              |      |
| xyz | Start     | sign_up  |              | Link |

,我需要用其他行提供的非空数据填充每个id的空行

| id  | operation | title    | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start     | recovery | SMS          | Link |
| abc | Start     | recovery | SMS          | Link |
| abc | Start     | recovery | SMS          | Link |
| abc | Set       | recovery | Email        | Link |
| abc | Verify    | recovery | Email        | Link |
| pqr | Start     | sign_in  | Push         | OTP  |
| pqr | Verfiy    | sign_in  | Push         | OTP  |
| pqr | Verify    | sign_in  | Push         | OTP  |
| xyz | Start     | sign_up  |              | Link |

指出

  • 某些id可以将某些字段在所有行
  • 中为空。
  • ,虽然大多数id将具有相同的非空值为每个字段,边缘情况可能有不同的值。对于这样的组,可以在所有行中填充任何非空值。[这在我的数据集中太少见了,可以忽略]
  • 另一个额外的模式是某些字段大多数只存在于某些operations的行,例如mode只存在于operation='Start'

我尝试通过id分组行,同时在title,channel_typemode列上执行listagg,然后是coalesce,以下内容:

WITH my_data AS (
SELECT
id,
operation,
title,
channel_type,
mode
FROM
my_db.my_table
),
list_aggregated_data AS (
SELECT
id,
listagg(title) AS titles,
listagg(channel_type) AS channel_types,
listagg(mode) AS modes
FROM
my_data
GROUP BY
id
),
coalesced_data AS (
SELECT DISTINCT
id,
coalesce(titles) AS title,
coalesce(channel_types) AS channel_type,
coalesce(modes) AS mode
FROM
list_aggregated_data
),
joined_data AS (
SELECT
md.id,
md.operation,
cd.title,
cd.channel_type,
cd.mode
FROM
my_data AS md
LEFT JOIN
coalesced_data AS cd ON cd.id = md.id
)
SELECT
*
FROM
joined_data
ORDER BY
id,
operation

但是由于某种原因,这导致了值的串联(大概来自coalesce操作),我得到

| id  | operation | title            | channel_type | mode |
|-----|-----------|------------------|--------------|------|
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Set       | recoveryrecovery | Email        | Link |
| abc | Verify    | recoveryrecovery | Email        | Link |
| pqr | Start     | sign_in          | Push         | OTP  |
| pqr | Verfiy    | sign_in          | Push         | OTP  |
| pqr | Verify    | sign_in          | Push         | OTP  |
| xyz | Start     | sign_up          |              | Link |

解决这个问题的正确方法是什么?

我将从带有忽略null选项的first_value()窗口函数开始。您将按前2列进行分区,并且需要使用一些数据传递来解决边缘情况,可能是在窗口函数的order by子句中。

最新更新