查询之间的定量依赖关系

我有以下问题：我的表1有N个阳性样本，并且随着时间的推移缓慢增长。我想从另一个巨大的表中选择 10N 个负样本。所以它会是这样的：

WITH positive_samples AS (
SELECT * FROM table1
), negative_samples AS (
SELECT * FROM table2 LIMIT 100 
)

这个查询几乎没有问题：它不能保证我会有大约 10 倍于positive_samples的negative_samples，并且它不会随机选择负样本。

在 Hive 或 Presto 中选择这两个集合的正确查询是什么？

一种算法可以在 HIVE 中获得您想要的输出：

R1 = 随机化负数据集 R2 = 为此 R1 分配行号 CP = 创建一个包含一行和一列包含 POSTIVIE 行计数的表。将专栏称为postive_cnt。 J = 取 R2 和 CP 的笛卡尔积。 FINAL = 从 J 中选择行，其中 row_number <= (positive_cnt * 10)

实际查询(在某些数据集上测试)：

with 
pcount as ( select count(*) as positive_cnt from POSITIVE)
,
nrandom as( select * from NEGATIVE order by rand())
,
nrandom_row_num as ( select *, row_number() over() as row_number from nrandom )
,
jnd as (select * from nrandom_row_num, pcount)
select * from jnd
where row_number <= (positive_cnt * 10);

相关内容

最新更新

热门标签：