作为一个更长、更复杂的查询的一部分,我试图只保留一个重叠间隔的条目,以及所有不重叠的条目。这里有一个最小的例子:
create table protein (
seqid varchar(100),
start SMALLINT(5),
`end` SMALLINT(5),
cutoff FLOAT(5,4),
seq_region TEXT
);
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 280, 290, 0.75, "RIQHQQKVKEL");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 314, 556, 0.75, "EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 356, 406, 1.0, "PPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLK");
SELECT * from protein;
A0MZ66|280|290|0.75|CCCCCC
A0MZ66|314|556|0.75|ABCDEFG
A0MZ66|356|406|1.0|ABCD
条目2和3具有相同的id和重叠的范围(从一个开始和结束包含在另一个中(,但cutoff
和seq_region
不同。条目#3实际上是条目#2的子字符串。我不能放入sql的是条件:
- 如果同一seqid的两个范围重叠,请选择得分==0.75的一个(或最长的seq_region,因为这些属性绑定在一起(
所需输出应为条目#1和#2:
A0MZ66|280|290|0.75|RIQHQQKVKEL
A0MZ66|314|556|0.75|EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT
如何将其作为SQL查询?重叠条件可以假设一个区间总是包含在另一个区间中(开始或结束可以相同(。如果重要的话,它是一个SQLite3数据库。
我想我需要做一些自我内心的加入,或者通过操作进行分组,但我做得不太好。我非常感谢你的意见。
您可以使用NOT EXISTS
:
select p.* from protein p
where not exists (
select 1 from protein
where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
and seq_region like '%' || p.seq_region || '%'
)
请参阅演示
或者,如果您想使用列start
和end
来获得重叠间隔:
select p.* from protein p
where not exists (
select 1 from protein
where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
and start <= p.start and end >= p.end and (end - start) > (p.end - p.start)
)
请参阅演示
结果:
| seqid | start | end | cutoff | seq_region |
| ------ | ----- | --- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| A0MZ66 | 280 | 290 | 0.75 | RIQHQQKVKEL |
| A0MZ66 | 314 | 556 | 0.75 | EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT |
这是一个"间隙&Islands";问题首先,您需要从同一组中识别行,然后根据您的标准从每一行中选择一行。例如,您可以如下所示解析此查询:
with
y as (
select *,
sum(st) over(partition by seqid order by start, end) as grp
from (
select *,
case when start >
max(`end`)
over(partition by seqid
order by start, end
rows between unbounded preceding and 1 preceding)
then 1 else 0 end as st
from protein
) x
),
z as (
select *,
row_number() over(partition by seqid, grp
order by case when cutoff = 0.75 then 1 else 2 end,
length(seq_region) desc) as rn
from y
)
select * from z where rn = 1
结果:
seqid start end cutoff seq_region st grp rn
------- ------ ---- ------- ------------- --- ---- --
A0MZ66 280 290 0.75 RIQHQQKVKEL 0 0 1
A0MZ66 314 556 0.75 EEDKKELELK... 1 1 1
请参阅DB Fiddle上的运行示例。