Hive 查询使用 "not column=value" where 子句删除空值



table1 数据样本:

year month day utmsource
2017 03    26  NULL
2017 03    27  NULL
2017 03    27  facebook
2017 03    27  newsletter
2017 03    27  banner
2017 03    27  facebook    

预期选择:

year month day utmsource
2017 03    27  NULL
2017 03    27  newsletter
2017 03    27  banner 

我的蜂巢查询:

-- result = 0, it did not include the NULL utmsource record
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource="facebook"
-- result = 1 the NULL utmsource record is included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND (utmsource IS NULL OR NOT utmsource="facebook")
-- also returns 0, the NULL utmsource record is not included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

问题:

  1. 有人可以解释这种行为吗?
  2. 我可以将设置更改为 检索查询 2 的结果而不添加额外的 OR 查询中的功能?=> 不等于在结果中包含空值

你想要的是一个NULL安全的相等(或不相等)运算符。 在 ANSI SQL 中,有一个名为is distinct from的运算符。 Hive似乎使用MySQL版本,这是<=>。 因此,您可以执行以下操作:

SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

文档中介绍了此运算符。

我还应该指出,您可能会发现这是一个更简单的SELECT公式:

SELECT (COUNT(*) - COUNT(utmsource)) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

虽然,总的来说,这似乎是最简单的:

SELECT COUNT(*)as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND utmsource IS NULL;

'Facebook'进行比较是不必要的。

最新更新