聚类分析,在 injetion 之前对数据进行排序会提高截断表的性能吗?



我的目标是摄取数据,按特定列排序,以便分区也按该顺序排列,以使该列上的修剪高效。

我想最大限度地降低排序成本,并希望获得一些关于我应该多久重新聚类一次的指导。

例如:

CREATE TABLE test_order(n NUMBER, s STRING);
INSERT INTO test_order 
VALUES 
(12, 'a'), 
(11, 'b'), 
(10, 'c'), 
(9, 'd'), 
(8, 'e'), 
(7, 'f'), 
(6, 'g'), 
(5, 'h'), 
(6, 'i'), 
(5, 'j'), 
(4, 'k'), 
(3, 'l'), 
(2, 'm'), 
(1, 'n');
SELECT * FROM test_order 
ORDER BY n ASC;
ALTER TABLE test_order CLUSTER BY (n, s);
ALTER TABLE test_order RECLUSTER;
SELECT n, s FROM test_order;
SELECT SYSTEM$CLUSTERING_INFORMATION('test_order', '(n,s)');

这是第一个插入的信息:

{
"cluster_by_keys" : "LINEAR(N, S)",
"total_partition_count" : 1,
"total_constant_partition_count" : 0,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}

这是第二个插入页上的信息:

INSERT INTO test_order 
VALUES 
(12, 'p'), 
(11, 'f'), 
(10, 'z'), 
(9, 'y'), 
(8, 'x'), 
(7, 'w'), 
(6, 'v'), 
(5, 'u'), 
(6, 't'), 
(5, 's'), 
(4, 'r'), 
(3, 'q'), 
(2, 'p'), 
(1, 'o');

{
"cluster_by_keys" : "LINEAR(N, S)",
"total_partition_count" : 2,
"total_constant_partition_count" : 0,
"average_overlaps" : 1.0,
"average_depth" : 2.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 2,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}

然后在第二次重新聚类后:

{
"cluster_by_keys" : "LINEAR(N, S)",
"total_partition_count" : 2,
"total_constant_partition_count" : 0,
"average_overlaps" : 1.0,
"average_depth" : 2.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 0,
"00002" : 2,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}

抱歉,我是格式化新手,但在插入特定顺序后,聚类比率没有太大变化 - 这是因为我的数据集示例太小,还是顺序对集群性能无关紧要?

如果您要摄取排序的数据,我认为您不需要对表进行聚类。 您的数据将自然聚类,您将获得所需的修剪。

最新更新