apachespark如何在SQL中过滤出组的第一个记录为空的分区有序表中的记录



数据

ROW  YEAR  PROD   KEY   DATE
1    2011  APPLE  TIME  2011-11-18 00:00:00.000
2    2011  APPLE  TIME  2011-11-19 00:00:00.000
3    2013  APPLE  NULL  2011-11-18 00:00:00.000
4    2013  APPLE  NULL  2011-11-19 00:00:00.000
5    2013  APPLE  TIME  2014-04-08 00:00:00.000
6    2013  APPLE  DIM   2014-04-09 00:00:00.000
7    2013  APPLE  TIME  2014-11-10 10:50:14.113
8    2013  APPLE  TIME  2014-11-12 10:46:04.947
9    2013  MELON  JAK   2011-10-17 11:01:19.657
10   2013  MELON  TIME  2014-11-18 11:19:35.547
11   2013  MELON  NULL  2014-11-19 11:19:35.547
12   2013  MELON  TIME  2014-11-21 10:32:36.017
13   2014  APPLE  JAK   2003-04-10 00:00:00.000
14   2014  APPLE  DIM   2003-04-11 00:00:00.000
15   2015  APPLE  TIME  2002-09-27 00:00:00.000
16   2015  APPLE  NULL  2004-09-28 00:00:00.000

ROW不是表中的列。只是为了显示我想要的唱片。

问题

以上数据按年份划分,按日期排序。

我需要根据以下逻辑保留除第3行和第4行之外的所有行:

  • 如果组的第一行(此处为(YEAR,PROD((为NULL,则丢弃它们
  • 11和16是空的,但我们保留它们,因为它们不是它们组中的第一个

每个组都必须从KEY不为空的记录开始

===>否则丢弃

换句话说,我可以有:not null,null,not null,null

但我不能有:null,not null,null,not null

预期结果

ROW  YEAR  PROD   KEY   DATE
1    2011  APPLE  TIME  2011-11-18 00:00:00.000
2    2011  APPLE  TIME  2011-11-19 00:00:00.000
5    2013  APPLE  TIME  2014-04-08 00:00:00.000
6    2013  APPLE  DIM   2014-04-09 00:00:00.000
7    2013  APPLE  TIME  2014-11-10 10:50:14.113
8    2013  APPLE  TIME  2014-11-12 10:46:04.947
9    2013  MELON  JAK   2011-10-17 11:01:19.657
10   2013  MELON  TIME  2014-11-18 11:19:35.547
11   2013  MELON  TIME  2014-11-19 11:19:35.547
12   2013  MELON  TIME  2014-11-21 10:32:36.017
13   2014  APPLE  JAK   2003-04-10 00:00:00.000
14   2014  APPLE  DIM   2003-04-11 00:00:00.000
15   2015  APPLE  TIME  2002-09-27 00:00:00.000
16   2015  APPLE  TIME  2004-09-28 00:00:00.000

我想这样做,所以以后在每个组的开头总是有一个非空键。这样,我以后总是可以使用前一行来填充具有空值的后续记录(在本例中为11和16(

如有任何意见或建议,我们将不胜感激!

下面得到您想要的输出。我正在检查前一行和当前行之间的键列的值,由于NULL具有最高的秩,如果前一行不为NULL,则会用not NULL列填充字段min_val。

select * from (
select year,prod,key1,date1
,min(key1) over(partition by year,prod order by date1 asc) as min_val
from t
)x
where x.min_val is not null   

+------+-------+------+-------------------------+---------+
| year | prod  | key1 |          date1          | min_val |
+------+-------+------+-------------------------+---------+
| 2011 | APPLE | TIME | 2011-11-18 00:00:00.000 | TIME    |
| 2011 | APPLE | TIME | 2011-11-19 00:00:00.000 | TIME    |
| 2013 | APPLE | TIME | 2014-04-08 00:00:00.000 | TIME    |
| 2013 | APPLE | DIM  | 2014-04-09 00:00:00.000 | DIM     |
| 2013 | APPLE | TIME | 2014-11-10 10:50:14.113 | DIM     |
| 2013 | APPLE | TIME | 2014-11-12 10:46:04.947 | DIM     |
| 2013 | MELON | JAK  | 2011-10-17 11:01:19.657 | JAK     |
| 2013 | MELON | TIME | 2014-11-18 11:19:35.547 | JAK     |
| 2013 | MELON |      | 2014-11-19 11:19:35.547 | JAK     |
| 2013 | MELON | TIME | 2014-11-21 10:32:36.017 | JAK     |
| 2014 | APPLE | JAK  | 2003-04-10 00:00:00.000 | JAK     |
| 2014 | APPLE | DIM  | 2003-04-11 00:00:00.000 | DIM     |
| 2015 | APPLE | TIME | 2002-09-27 00:00:00.000 | TIME    |
| 2015 | APPLE |      | 2004-09-28 00:00:00.000 | TIME    |
+------+-------+------+-------------------------+---------+

链接https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=ae82f64802674aa60005b8e9f534a150

可能有更花哨的解决方案,但本质上(如果KEY、DATE等不是产品中的保留词,则可以去掉方括号-我使用了TSQL(:

select * 
from Tbl T1
where 
/* Do not include if... */
NOT (
t1.[KEY] is null
/* This is part of the first KEY=NULL rows for this group 
(no preceding record with KEY<>NULL) */
and not exists
(select 1
from Tbl T3
where T3.[YEAR]=T1.[YEAR]
and T3.PROD=T1.PROD
and T3.[DATE] < T1.[DATE]
and T3.[KEY] is not null
)
/* There are KEY<>NULL values further down */
and exists 
(select 1
from Tbl T2
where T2.[YEAR]=T1.[YEAR]
and T2.PROD=T1.PROD
and T2.[DATE] > T1.[DATE]
and T2.[KEY] is not null
)
)

这种查询可能会有所帮助:

select YEAR, PROD, KEY, DATE
from (
select YEAR, PROD, KEY, DATE, 
MIN(CASE WHEN KEY IS NULL THEN DATE ELSE NULL END)
OVER(PARTITION BY YEAR, PROD) AS MIN_NULL_KEY_DATE,
ROW_NUMBER() OVER(PARTITION BY YEAR, PROD ORDER BY DATE ASC) RN
from your_table yt
)rpr
where 1 = 1
and CASE WHEN RN = 1 AND DATE = MIN_NULL_KEY_DATE THEN 0 ELSE 1 END = 1

那么我在这里尝试实现了什么:当key列为null时,我们刚刚找到了基于年份和prod列的最小日期。还要检查该行是否为该组的第一行。如果rn=1并且日期等于键为null时的最小日期值,则忽略它们以备不时之需。

相关内容

  • 没有找到相关文章

最新更新