从A和b中选择样本的查询

我有A和B两个总体，我需要首先从A中选择10个唯一的随机样本，然后从B中选择10个唯一的随机样本，这些样本也不在A中选择的样本中。唯一性仅基于ID。虽然有10个唯一的id，但总行数可以更多。

我遵循了这些步骤。首先我从A中得到10个不同的样本用来得到对应的行。1 .

select * from A t1 inner join (select distinct id from A
tablesample(10 rows)) t2 where t1.id = t2.id Stored this as A_records

我创建了一个临时视图来存储B可用的池。这将从B中重新出现的第一个示例的任何id中删除(虽然不需要，但我这样做是为了我自己的理智)

create or replace view B_pool as (select distinct id from B where B.Id
not in (select distinct ID from A_records)

现在我从B中选择样本

select * from B t1 inner join (select distinct ID from B_pool
tablesample(10 rows)) t2 on t1.id = t2.id

我觉得这个逻辑应该行得通。但是，我似乎仍然在整个样本中得到重复(来自B的样本包含来自A的样本中的id)。

我怎样才能避免这些重复?

总体A和B的一些样本数据以及A和B的期望结果

示例数据想要的结果

对我来说查询看起来很好。它们生成的行id都不包含在另一个集合中。

访问不同id的一个简单方法是使用模函数。例如，一个数据集使用where mod(id,2) = 0，另一个数据集使用where mod(id,2) = 1。当然，只要表中有足够的行，您可以除以任何数字，使其看起来比一组中的偶数id和另一组中的奇数id更随机，例如:where mod(id,123) = 45。

完整查询:

select *
from A 
where id in (select distinct id 
from A
where mod(id,2) = 0 
limit 10);
select * 
from B
where id in (select distinct id 
from B
where mod(id,2) = 1 
limit 10);

如果需要随机性，可以在子查询中添加ORDER BY子句。

相关内容

最新更新

热门标签：