使用行号函数分组



我一直使用row_number()函数只选择我需要的观察值。
在我的场景中,每当一个特定的<id, entity_id, period, element>有两个不同的名称时,应该忽略National。如果只有一个,就拿唯一的一个。

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123    |   2021 | Overall value | Overall Estimation        |
+----+-----------+--------+---------------+---------------------------+

对于上面这样的情况,下面的方法可以达到目的:

SELECT *
FROM   (SELECT *,
Row_number()
OVER (
partition BY id, entity_id, period, element
ORDER BY NAME DESC) AS rn
FROM   mydata) table
WHERE  table.rn = 1 

问题是现在还有其他类似的情况:

+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period |    element    |           name            |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123    |   2021 | Overall value | National Based - ZYX      |
| 12 | ABC123    |   2021 | Overall value | Base Estimation           |
+----+-----------+--------+---------------+---------------------------+

和当前的SQL,这将不工作,因为我必须改变顺序由降序升序。

有没有可能取消"国家……"的优先级?把另一张记录下来,以防有多张?

我正在Hive/Impala上运行查询。

如果您添加另一个派生表层(或使用CTE),那么您可以添加CASE WHEN来检查"name"'National'开始,并给它一个简单的整数"标签"值,可用于取消这些行的优先级。

…像这样:

WITH q AS (
SELECT
"id",
"entity_id",
"period",
"element",
"name",
CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
FROM
mydata
),
filtered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
"id", "entity_id", "period", "element"
ORDER BY
"tag" DESC,
"name" DESC
) AS rn
FROM
q
)
SELECT
*
FROM
filtered 
WHERE
rn = 1

最新更新