使用特定列，输出文本文件中出现 3 次的行

我有一个文本文件，想要输出前 4 列在文件中恰好出现三次的行。

chr1    1   A   T   sample1
chr1    3   G   C   sample1
chr2    1   G   C   sample1
chr2    2   T   A   sample1
chr3    4   T   A   sample1
chr1    1   A   T   sample2
chr2    3   T   A   sample2
chr3    4   T   A   sample2
chr1    1   A   T   sample3
chr2    1   G   C   sample3
chr3    4   T   A   sample3
chr1    1   A   T   sample4
chr2    1   G   C   sample4
chr5    1   A   T   sample4
chr5    2   G   C   sample4

如果一行出现三次，我想为它出现的其他两个样本添加两列，以便上面的输出如下所示：

chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

我会在 R 中执行此操作，但文件太大而无法读取，所以我正在寻找一种适用于 Linux 的解决方案。我一直在研究awk，但找不到任何适用于这种确切情况的东西。

文件当前未排序。

提前感谢！

编辑：感谢所有这些内容丰富的答案。我选择了我最熟悉的工作方式，但其他答案看起来也很棒，我将从中学习。

使用 GNUdatamash，tr和awk假设输入和输出是制表符分隔的：

$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' 't' | awk 'NF==7'
chr3    4       T       A       sample1 sample2 sample3

首先，使用datamash对输入文件进行排序，对前四个字段进行分组，然后折叠第 5 个字段的值(逗号分隔(。输出如下所示：

$ datamash -s -g1,2,3,4  collapse 5 < file
chr1    1       A       T       sample1,sample2,sample3,sample4
chr1    3       G       C       sample1
chr2    1       G       C       sample1
chr2    2       G       C       sample3,sample4
chr2    2       T       A       sample1
chr2    3       T       A       sample2
chr3    4       T       A       sample1,sample2,sample3
chr5    1       A       T       sample4
chr5    2       G       C       sample4

然后将输出通过管道传输到tr以将逗号转换为制表符，最后使用awk打印包含七个字段的行。

使用awk：

awk '
BEGIN{ FS=OFS="t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file

使用前四个字段作为索引维护两个数组。
每当遇到具有相同索引的记录时，第一个递增计数器，第二个使用制表符作为分隔符附加第 5 个字段。

在结束块中，遍历cnt数组，如果计数为 3，则打印索引和data数组的值。

为了好玩，一个使用 sqlite 的解决方案(包装在一个将数据文件作为唯一参数的 shell 脚本中(

#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF

然后：

$ ./demo.sh input.tsv
chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

这可能是您要查找的内容：

$ cat tst.awk
BEGIN { FS=OFS="t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }

$ sort -k1,4 file | awk -f tst.awk
chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

sort使用分页等来处理太大而无法放入内存的输入，因此它将成功处理比其他工具可以处理的更大的输入，并且awk脚本在内存中几乎不存储任何内容。

相关内容

最新更新

热门标签：