r语言 - 使用 bash (linux) 选择制表符分隔文件的特定行



我有一个目录,很多txt制表符分隔的文件,有几行和几列,例如

File1
Id    Sample   Time ...  Variant[Column16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C221C:p.D461W
3     s5       t1        c.G31T:p.G61R
File2
Id    Sample   Time ...  Variant[Column16] ...
1     s1       t0        c.B481A:p.G861S
2     s2       t2        c.C21C:p.D61W
3     s5       t1        c.G1T:p.G1R

我正在寻找的是创建一个新文件:

  • 所有不同的变体Uniq
  • 重复的变体数量
  • 和文件位置

即:

NewFile
Variant             Nº of repeated       Location
c.B481A:p.G861S     2                    File1,File2
c.C221C:p.D461W     1                    File1
c.G31T:p.G61R       1                    File1
c.C21C:p.D61W       1                    File2
c.G1T:p.G1R         1                    File2

我认为在 bash 中使用带有 awk 排序和 uniq 的基本脚本它会起作用,但我不知道从哪里开始。或者,如果使用 Rstudio 或 python(3( 更容易,我可以尝试。

谢谢!!

Pure bash。需要版本 4.0+

# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]}        # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[@]}"; do
printf "%st%dt%sn" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done

输出

c.B481A:p.G861S 2   File1,File2
c.G1T:p.G1R 1   File2
c.C221C:p.D461W 1   File1
c.G31T:p.G61R   1   File1
c.C21C:p.D61W   1   File2

输出行的顺序是不确定的:关联数组没有特定的顺序。

我认为纯粹的bash会很难,但每个人都有一些尴尬:D

awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s    %sn","Variant","Nº","Location");
for (v in n) printf("%-24s %6d    %sn",v,n[v],a[v])}' *

最新更新