用递归或函数代替SELECT中的迭代INSERT来遍历Postgres中的路径



在我的模式中,数据集有许多cfilescfile有一个数据集。每个cfile也有许多存储在jsonb中的动态属性值。

这里小提琴

SELECT * FROM datasets;
id |   name   
----+----------
1 | Dataset1
2 | Dataset2

SELECT * FROM cfiles WHERE dataset_id=1;
id | dataset_id |            path             |                     property_values                      
----+------------+-----------------------------+----------------------------------------------------------
1 |          1 | dir_i/file1.txt             | {"Project": "ProjW", "Sample Names": ["sampA", "sampB"]}
2 |          1 | dir_i/dir_j/file2.txt       | {"Project": "ProjX", "Sample Names": ["sampA", "sampC"]}
3 |          1 | dir_i/dir_j/dir_k/file3.txt | {"Project": "ProjY", "Sample Names": ["sampD"]}
4 |          1 | dir_m/file4.txt             | {"Project": "ProjZ", "Sample Names": ["sampE"]}

从这个SO问题和奇妙的答案,我有这个问题:

INSERT into agg_prop_vals(dataset_id, path, sample_names, projects)
SELECT DISTINCT
cfiles.dataset_id,
'.' as path,
-- ** path specific:
-- 'dir_i/dir_j/' as path,
h."Sample Names", h."Project"
FROM (
SELECT
dataset_id,
string_agg(DISTINCT "Sample Names", '; ' ORDER BY "Sample Names") as "Sample Names",
string_agg(DISTINCT "Project", '; ' ORDER BY "Project") as "Project"
FROM (
SELECT
cfiles.dataset_id as dataset_id,
property_values ->> 'Project' as "Project",
jsonb_array_elements_text(property_values -> 'Sample Names') as "Sample Names"
FROM cfiles
WHERE cfiles.dataset_id=1
-- ** path specific:
-- AND cfiles.path LIKE 'dir_i/dir_j/%'
) g GROUP BY dataset_id
) h
JOIN cfiles ON (cfiles.dataset_id=h.dataset_id)
WHERE cfiles.dataset_id=1
ON CONFLICT (dataset_id, path)
DO UPDATE SET
sample_names = excluded.sample_names,
projects = excluded.projects

生成特定数据集的聚合cfile属性值表:

SELECT * FROM agg_prop_vals;
dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

现在这是伟大的获得我的聚合值每数据集,但我现在也想得到他们每数据集+路径,所以像这样:

SELECT * FROM agg_prop_vals;
dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ
1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY
1 | dir_i/dir_j/       | sampA; sampC; sampD               | ProjX; ProjY
1 | dir_i/dir_j/dir_k/ | sampD                             | ProjY
1 | dir_m/             | sampE                             | ProjZ

所有的处理都是一次完成一个数据集,所以我很高兴迭代数据集(所以WHERE cfiles.dataset_id=1可以被忽略/作为这个例子中的常量)。我遇到的问题是遍历路径。

我可以对数据集中的每个路径运行相同的查询(例如取消注释** path specific:),但是当单个数据集中有数千个子路径时,这可能需要长达一个小时。如:

("SELECT DISTINCT SUBSTRING(path, '(.*/).*') FROM cfiles WHERE dataset_id=1").each do |sub_path|
aggregate_query(sub_path)
end

但是这也是低效的,因为它不是在每一级使用已经计算过的子目录的聚合,而是在每一级对所有子文件再次执行查询。

ie to calculate:

dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

它应该添加预先计算的顶级子目录的聚合:

dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY

+:

dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
1 | dir_m/             | sampE                             | ProjZ

而不是遍历所有子文件。

是否有任何方法我可以用某种查询或PL/SQL来替换迭代,使用递归或函数来遍历目录路径并相应地填充agg_prop_vals表?

其他点:

  • 我不能重组/规范化现有的datasetscfiles表,但我可以改变agg_prop_vals表并添加额外的表
  • 我不一定需要修改ON CONFLICT块-我可以在应用程序
  • 中拆分它

我将加载"find/lib/bin/etc"在一个表中大约有27k行…

BEGIN;
CREATE TABLE _files( path TEXT NOT NULL );
copy _files (path) from 'files.txt';
CREATE TABLE files( 
id SERIAL PRIMARY KEY, 
path TEXT NOT NULL,
dataset_id INTEGER NOT NULL,
attrib TEXT[] NOT NULL
);
INSERT INTO files (path,dataset_id,attrib) SELECT path,n,ARRAY[RIGHT(path,1),RIGHT(path,2)]
FROM _files CROSS JOIN (SELECT generate_series(1,10) n) n;
COMMIT;
VACUUM ANALYZE files;
CREATE INDEX files_dataset ON files(dataset_id);

我已经添加了一个generate_series来将文件数量乘以10。

专栏"attrib"包含两个文本值,它们将成为您的"样本"。

我将假设路径中没有双斜杠,并且所有路径都不是以斜杠结束的。如果不是这种情况,您必须在查询中的适当位置放置它:

regexp_replace( regexp_replace( path, '(//+)', '/', 'g' ), '/$', '')

然后我们添加一个parent_path列。Postgres regexp很慢,所以这需要一些时间。

CREATE TEMPORARY TABLE fp AS
SELECT *, regexp_replace( path, '/[^/]+$', '' ) AS parent_path 
FROM files WHERE dataset_id=1;

旁注:要在SQL中建模路径/树,您可以使用parent_id,或者只是将路径粘贴在列中,但在这种情况下,数组比字符串工作得更好,因为它很容易访问元素。

我添加了一个"属性"列,类型为TEXT[],它对上面聚合sample_names和项目的查询的结果进行建模。它是一个数组,因为我们稍后要把它拆开。

。现在我们必须构建一个目录树,其中包括没有文件的目录,这些目录不在上面生成的parent_path中,这意味着它们必须通过递归查询生成。因为SQL就是SQL,所以它不是从根目录开始,而是从完整路径开始,然后反向运行。

CREATE TEMPORARY TABLE dirs (
path TEXT UNIQUE NOT NULL,
parent_path TEXT NOT NULL,
attrib1 TEXT[] NULL,
attrib2 TEXT[] NULL );
INSERT INTO dirs (path, parent_path)
WITH RECURSIVE pdirs AS (SELECT * FROM
(SELECT parent_path AS path,
regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM fp
) x1
UNION  SELECT * FROM
(SELECT parent_path AS path,
regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM pdirs
) x2 WHERE path != '' OR parent_path != ''
) SELECT * FROM pdirs ORDER BY path;

Not,对于表fp中的每一行,拆分每行中的属性数组,删除重复项,并将其重新组装成一个数组。有两种方法可以做到这一点……第一个更快,但需要在临时表上建立索引。那么,让我们使用第二个。

SELECT dirs.path, (SELECT array_agg(a) FROM (SELECT DISTINCT unnest(attrib) a FROM fp WHERE fp.parent_path=dirs.path) x) FROM dirs;
SELECT parent_path, array_agg(DISTINCT att) FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path;

现在,"just"对表dirs执行相同的递归操作,沿路径传播属性…两次,一次用于示例,一次用于项目,因为递归CTE在查询中不能被引用多次…

WITH RECURSIVE rdirs AS (
SELECT dirs.*, attrib FROM
(SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path) AS x
JOIN dirs ON (dirs.path=x.parent_path)
UNION ALL
SELECT dirs.*, attrib FROM
(SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM rdirs) x GROUP BY parent_path) AS x
JOIN dirs ON (dirs.path=x.parent_path)
WHERE dirs.path != '' OR dirs.parent_path != ''
)
UPDATE dirs 
SET attrib1=rdirs.attrib
FROM rdirs
WHERE dirs.path=rdirs.path;

因此,您对projects列再次执行此操作(相应地更改列名),临时表dirs应该包含所需的结果!

如果你喜欢挑战的话,很可能只需要一个查询,没有临时表就可以完成所有这些!

相关内容

  • 没有找到相关文章

最新更新