注意:我已经讨论了以下相关问题,这些问题不涉及我的查询
- SQL:如何在一列中选择每一组具有重复值的行?
- 用红移中的第一个非空值填充缺失值
我有一个像这样的稀疏的,不干净的数据集
| id | operation | title | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start | | | |
| abc | Start | recovery | | Link |
| abc | Start | recovery | SMS | |
| abc | Set | | Email | |
| abc | Verify | | Email | |
| pqr | Start | | | OTP |
| pqr | Verfiy | sign_in | Push | |
| pqr | Verify | | | |
| xyz | Start | sign_up | | Link |
,我需要用其他行提供的非空数据填充每个id
的空行
| id | operation | title | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start | recovery | SMS | Link |
| abc | Start | recovery | SMS | Link |
| abc | Start | recovery | SMS | Link |
| abc | Set | recovery | Email | Link |
| abc | Verify | recovery | Email | Link |
| pqr | Start | sign_in | Push | OTP |
| pqr | Verfiy | sign_in | Push | OTP |
| pqr | Verify | sign_in | Push | OTP |
| xyz | Start | sign_up | | Link |
指出
- 某些
id
可以将某些字段在所有行 中为空。 - ,虽然大多数id将具有相同的非空值为每个字段,边缘情况可能有不同的值。对于这样的组,可以在所有行中填充任何非空值。[这在我的数据集中太少见了,可以忽略]
- 另一个额外的模式是某些字段大多数只存在于某些
operation
s的行,例如mode
只存在于operation='Start'
行
我尝试通过id
分组行,同时在title
,channel_type
和mode
列上执行listagg
,然后是coalesce
,以下内容:
WITH my_data AS (
SELECT
id,
operation,
title,
channel_type,
mode
FROM
my_db.my_table
),
list_aggregated_data AS (
SELECT
id,
listagg(title) AS titles,
listagg(channel_type) AS channel_types,
listagg(mode) AS modes
FROM
my_data
GROUP BY
id
),
coalesced_data AS (
SELECT DISTINCT
id,
coalesce(titles) AS title,
coalesce(channel_types) AS channel_type,
coalesce(modes) AS mode
FROM
list_aggregated_data
),
joined_data AS (
SELECT
md.id,
md.operation,
cd.title,
cd.channel_type,
cd.mode
FROM
my_data AS md
LEFT JOIN
coalesced_data AS cd ON cd.id = md.id
)
SELECT
*
FROM
joined_data
ORDER BY
id,
operation
但是由于某种原因,这导致了值的串联(大概来自coalesce
操作),我得到
| id | operation | title | channel_type | mode |
|-----|-----------|------------------|--------------|------|
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Set | recoveryrecovery | Email | Link |
| abc | Verify | recoveryrecovery | Email | Link |
| pqr | Start | sign_in | Push | OTP |
| pqr | Verfiy | sign_in | Push | OTP |
| pqr | Verify | sign_in | Push | OTP |
| xyz | Start | sign_up | | Link |
解决这个问题的正确方法是什么?
我将从带有忽略null选项的first_value()窗口函数开始。您将按前2列进行分区,并且需要使用一些数据传递来解决边缘情况,可能是在窗口函数的order by子句中。