如何使用标准偏差删除SQL中的异常值



我目前有一个数据集,其中包含交易信息,包括日期、金额字段和描述性字段。我想做简单的统计分析(平均值、中位数等(,但需要先去除异常值。我遇到的问题是在按地区、城市和日期对数据进行分组时删除异常值。我确信使用健壮的编程语言(R、Python等(可以更容易地完成这项工作,但我必须使用SQL(SSMS(来完成这项任务。

简单数据集示例:

| Region | City    | Date       | Amount |
| ------ | ------- | ---------- | ------ |
| SW     | Phoenix | 2021-10-01 | 400    |
| NE     | Boston  | 2021-10-03 | 20     |
| SW     | Phoenix | 2021-10-03 | 800    |
| SW     | Phoenix | 2021-10-02 | 425    |
| NE     | Boston  | 2021-10-01 | 500    |
| SW     | Phoenix | 2021-10-02 | 15     |
| SW     | Phoenix | 2021-10-04 | 100    |
| NE     | Boston  | 2021-10-04 | 35     |
| SE     | Orlando | 2021-10-02 | 300    |

最初我想使用IQR方法,但我认为使用标准偏差(STDEV(会更容易。然而,一旦包含STDEV代码,我就会遇到分组问题。

这是我目前的代码:

WITH CTE_data AS (
SELECT 
Region
,City
,Date
,Amount
FROM OrderTable
)
SELECT 
Region
,City
,MAX(Date) AS MaxDate
,MIN(Date) AS MinDate
,AVG(Amount) AS AvgAmt
,STDEV(Amount) AS StedvAmt
FROM CTE_data
GROUP BY Region, City 

给定示例数据集,我希望忽略(Amount>(AvgAmt-StedvAmt(AND Amount<(平均金额+现金金额((。我的最终目标是获得每个地区/城市的平均金额,以及平均金额(不包括异常值(。我想要的输出是:

| Region | City    | MinDate    | MaxDate    | AvgAmt | AvgAmt_rem |
| ------ | ------- | ---------- | ---------- | ------ | ---------- |
| SW     | Phoenix | 2021-10-01 | 2021-10-04 | 348    | 308.33     |
| NE     | Boston  | 2021-10-01 | 2021-10-04 | 185    | 27.5       |
| SE     | Orlando | 2021-10-02 | 2021-10-02 | 300    | 300        |

请注意:为了简单起见,我没有为我想要的输出进行实际的标准差计算,我只是根据数值去除了异常值(菲尼克斯为800&15,波士顿为500(。

没有必要使用自联接。您可以使用窗口函数对基表进行一次扫描来完成此操作

WITH cteStats as ( --Calculate the Avg & Std of the raw data
SELECT 
*
,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
FROM CTE_data
)
SELECT 
S.Region
,S.City
,MIN(S.ReadingDate) AS MinDate
,MAX(S.ReadingDate) AS MaxDate
,AVG(S.Amount) AS AvgFiltered
,STDEV(S.Amount) AS StedvFiltered
FROM cteStats as S
WHERE s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
GROUP BY s.Region, S.City;

db<gt;小提琴

通过使用条件聚合,您还可以显示有过滤行和无过滤行

WITH cteStats as ( --Calculate the Avg & Std of the raw data
SELECT 
*
,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
FROM CTE_data
)
SELECT 
S.Region
,S.City
,MIN(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
THEN S.ReadingDate END) AS MinDate
,MAX(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
THEN S.ReadingDate END) AS MaxDate
,AVG(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
THEN S.Amount END) AS AvgFiltered
,AVG(S.Amount) AS AvgAll
FROM cteStats as S
GROUP BY s.Region, S.City;

db<gt;小提琴

您可以进行两次计算,第一次仅用作过滤器,第二次排除异常值并重新计算新的平均

with CTE_data as ( --Your original data
SELECT * FROM ( VALUES 
('SW', 'Phoenix', CONVERT(DATE,'2021-10-01'), CONVERT(DECIMAL(18,6),'400'))
, ('NE', 'Boston', '2021-10-03', '20')
, ('SW', 'Phoenix', '2021-10-03', '800')
, ('SW', 'Phoenix', '2021-10-02', '425')
, ('NE', 'Boston', '2021-10-01', '500')
, ('SW', 'Phoenix', '2021-10-02', '15')
, ('SW', 'Phoenix', '2021-10-04', '100')
, ('NE', 'Boston', '2021-10-04', '35')
, ('SE', 'Orlando', '2021-10-02', '300')
) as TabA(Region, City, ReadingDate, Amount) 
), cteStats as ( --Calculate the Avg & Std of the raw data
SELECT 
Region
, City
, AVG(Amount) AS AvgRaw --ADD BELOW CASE to keep singleton from filtering itself
, CASE WHEN STDEV(Amount) > 0.01 THEN STDEV(Amount) ELSE 0.01 END AS StedRaw
FROM CTE_data
GROUP BY Region, City 
), ctFiltered as ( --do it again but this time filter out the outliers
SELECT 
D.Region
,D.City
,MIN(ReadingDate) AS MinDate
,MAX(ReadingDate) AS MaxDate
,AVG(Amount) AS AvgFiltered
,STDEV(Amount) AS StedvFiltered
FROM CTE_data as D --Your original data again, but tied to the stats and filtered on the stats
INNER JOIN cteStats as S on D.Region = S.Region AND D.City = S.City 
WHERE D.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
GROUP BY D.Region, D.City 
)
SELECT D.Region, D.City , D.MinDate , D.MaxDate
, CONVERT(decimal(18,2),S.AvgRaw) as AvgAmt
, CONVERT(decimal(18,2),D.AvgFiltered) as AvgAmt_rem
--, S.StedRaw as SDevPreFilter, D.StedvFiltered
FROM ctFiltered as D --Tie the filtered back to the stats so you can see the filter criteria
INNER JOIN cteStats as S on D.Region = S.Region AND D.City = S.City 

编辑:添加一个CASE语句来测试标准偏差是否为零(该城市的单个值(,并将其替换为一个小的StdDev,这无关紧要,因为如果有单个值,就不会有异常值!

相关内容

  • 没有找到相关文章