小贝子编程

SQL:执行欠采样以选择majority类的子集

本文关键字：majority 子集选择执行采样 SQL sql google-bigquery sampling
更新时间 : 2023-09-22
英文 : SQL: perform undersampling to select a subset of majority class

我有一个表，看起来像这样:

<表类> user_id 目标 tbody><<tr>127819809033450980001298112230

一个方法使用窗口函数:

select t.* except (seqnum, cnt1)
from (select t.*,
row_number() over (partition by target order by rand()) as seqnum,
countif(target = 1) over () as cnt1
from t
) t
where seqnum <= cnt1;

上面的可能会有性能问题——甚至会超出资源，因为要排序的数据量很大。一个近似的方法可能也适用于您的目的:

select t.* except (cnt, cnt1)
from (select t.*,
count(*) over (partition by target) as cnt,
countif(target = 1) over () as cnt1
from t
) t
where rand() < cnt * 1.0 / cnt1;

这不能保证产生完全相同的0和1的数量，但是这些数字将非常接近。

考虑下面的方法-它保留所有的target=1行和~50%的target=0行

select * 
from `dataset.mytable`
where if(target = 1, true, rand() < 0.5)

SQL:执行欠采样以选择majority类的子集

相关内容

最新更新

热门标签：