我一直使用row_number()
函数只选择我需要的观察值。
在我的场景中,每当一个特定的<id, entity_id, period, element>
有两个不同的名称时,应该忽略National。如果只有一个,就拿唯一的一个。
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123 | 2021 | Overall value | Overall Estimation |
+----+-----------+--------+---------------+---------------------------+
对于上面这样的情况,下面的方法可以达到目的:
SELECT *
FROM (SELECT *,
Row_number()
OVER (
partition BY id, entity_id, period, element
ORDER BY NAME DESC) AS rn
FROM mydata) table
WHERE table.rn = 1
问题是现在还有其他类似的情况:
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Based - ZYX |
| 12 | ABC123 | 2021 | Overall value | Base Estimation |
+----+-----------+--------+---------------+---------------------------+
和当前的SQL,这将不工作,因为我必须改变顺序由降序升序。
有没有可能取消"国家……"的优先级?把另一张记录下来,以防有多张?
我正在Hive/Impala上运行查询。
如果您添加另一个派生表层(或使用CTE),那么您可以添加CASE WHEN
来检查"name"
从'National'
开始,并给它一个简单的整数"标签"值,可用于取消这些行的优先级。
…像这样:
WITH q AS (
SELECT
"id",
"entity_id",
"period",
"element",
"name",
CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
FROM
mydata
),
filtered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
"id", "entity_id", "period", "element"
ORDER BY
"tag" DESC,
"name" DESC
) AS rn
FROM
q
)
SELECT
*
FROM
filtered
WHERE
rn = 1