我有一个目录,很多txt制表符分隔的文件,有几行和几列,例如
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
我正在寻找的是创建一个新文件:
- 所有不同的变体Uniq
- 重复的变体数量
- 和文件位置
即:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
我认为在 bash 中使用带有 awk 排序和 uniq 的基本脚本它会起作用,但我不知道从哪里开始。或者,如果使用 Rstudio 或 python(3( 更容易,我可以尝试。
谢谢!!
Pure bash。需要版本 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[@]}"; do
printf "%st%dt%sn" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
输出
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
输出行的顺序是不确定的:关联数组没有特定的顺序。
我认为纯粹的bash会很难,但每个人都有一些尴尬:D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %sn","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %sn",v,n[v],a[v])}' *