选择超过总值百分比的行的子集



我有一个关于客户、用户和收入的表格,类似于下面(实际上有数千条记录):

Customer   User    Revenue
001        James   500
002        James   750
003        James   450
004        Sarah   100
005        Sarah   500
006        Sarah   150
007        Sarah   600
008        James   150
009        James   100

我想做的是只返回支出最高的客户,这些客户占用户总收入的80%。

要手动做到这一点,我会根据James的客户的收入对他们进行排序,计算出总额的百分比和运行总额的百分比,然后只返回记录,直到运行总额达到80%:

Customer    User    Revenue     % of total  Running Total %
002         James   750         0.38        0.38 
001         James   500         0.26        0.64 
003         James   450         0.23        0.87  <- Greater than 80%, last record
008         James   150         0.08        0.95 
009         James   100         0.05        1.00 

我试过使用CTE,但到目前为止都是空白的。有没有任何方法可以通过单个查询而不是在Excel工作表中手动完成?

仅限SQL Server 2012+

您可以使用带窗口的SUM:

WITH cte AS
(
   SELECT *,
          1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY [User]) AS percentile,
          1.0 * SUM(Revenue) OVER(PARTITION BY [User] ORDER BY [Revenue] DESC)
                /SUM(Revenue) OVER(PARTITION BY [User]) AS running_percentile
   FROM tab
)
SELECT *
FROM cte 
WHERE running_percentile <= 0.8;

LiveDemo


SQL Server 2008:

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT *
FROM cte2
WHERE running_percentile <= 0.8;

LiveDemo2

输出:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

编辑2:

看起来差不多了,唯一的问题是它缺少最后一行,詹姆斯的第三排超过了0.80,但需要包括在内

WITH cte AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
    FROM t    
), cte2 AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM cte c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]
            AND c2.rn <= c.rn) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM cte c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte2 a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte2
             WHERE running_percentile >= 0.8
               AND cte2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp;

LiveDemo3

输出:

╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User  ║ Revenue ║   percentile   ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║        2 ║ James ║     750 ║ 0,384615384615 ║ 0,384615384615     ║
║        1 ║ James ║     500 ║ 0,256410256410 ║ 0,641025641025     ║
║        3 ║ James ║     450 ║ 0,230769230769 ║ 0,871794871794     ║
║        7 ║ Sarah ║     600 ║ 0,444444444444 ║ 0,444444444444     ║
║        5 ║ Sarah ║     500 ║ 0,370370370370 ║ 0,814814814814     ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝

看起来很完美,翻译到我的大桌子上并返回我需要的东西,花了整整5分钟的时间来完成它,仍然无法遵循你所做的

SQL Server 2008不支持OVER()子句中的所有内容,但ROW_NUMBER支持。

首先cte只计算组内的位置:

╔═══════════╦════════╦══════════╦════╗
║ Customer  ║ User   ║ Revenue  ║ rn ║
╠═══════════╬════════╬══════════╬════╣
║        2  ║ James  ║     750  ║  1 ║
║        1  ║ James  ║     500  ║  2 ║
║        3  ║ James  ║     450  ║  3 ║
║        8  ║ James  ║     150  ║  4 ║
║        9  ║ James  ║     100  ║  5 ║
║        7  ║ Sarah  ║     600  ║  1 ║
║        5  ║ Sarah  ║     500  ║  2 ║
║        6  ║ Sarah  ║     150  ║  3 ║
║        4  ║ Sarah  ║     100  ║  4 ║
╚═══════════╩════════╩══════════╩════╝

第二次cte:

  • c2子查询根据ROW_NUMBER的排名计算运行总数
  • c3计算每个用户的全额

在最终查询中,s子查询发现最低的running总数超过80%。

编辑3:

使用ROW_NUMBER实际上是多余的。

WITH cte AS
(
    SELECT c.Customer, c.[User], c.[Revenue]
           ,percentile         = 1.0 * Revenue / NULLIF(c3.s,0)
           ,running_percentile = 1.0 * c2.s    / NULLIF(c3.s,0)
    FROM t c
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]
            AND c2.Revenue >= c.Revenue) c2
    CROSS APPLY
         (SELECT SUM(Revenue) AS s
          FROM t c2
          WHERE c.[User] = c2.[User]) AS c3
) 
SELECT a.*
FROM cte a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
             FROM cte c2
             WHERE running_percentile >= 0.8
               AND c2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp
ORDER BY [User], Revenue DESC;

LiveDemo4

在SQL Server 2012+中,您可以使用累积和,这样效率会高得多。在SQLServer2008中,可以使用相关的子查询或cross apply:来执行此操作

select t.*,
       sum(t.Revenue*1.0) / sum(t.Revenue) over (partition by user) as [% of Total],
       sum(RunningRevenue*1.0) / sum(t.Revenue) over (partition by user) as [Running Total %]
from t cross apply
     (select sum(Revenue) as RunningRevenue
      from t t2
      where t2.Revenue >= t.Revenue and t2.user = t.user
     ) t2;

注意:*1.0只是为了防止Revenue被存储为整数。SQL Server执行整数除法,这将为几乎所有行的两列返回0

编辑:

如果只想得到James的结果,请添加where user = 'James'

最新更新