我有一个50 GB大小的文本file
,如下所示。我想从第一列和第三列删除chr,跳过以#
开头的任何行。我知道我可以用这种方式添加chr,但不确定如何删除它们cat ${file}.txt | awk -F"t" '{if ($0 !~ /^#/) {print "chr"$0} else{print $0}}' > ${file}_moreCHR.txt
file:
##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 69511 chr1:69511:A:G A G 11157600 PASS
chr1 69536 chr1:69536:C:A C A 581.98 PASS
chr1 69536 chr1:69536:C:T C T 581.98 PASS
Result I want:
##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69511 1:69511:A:G A G 11157600 PASS
1 69536 1:69536:C:A C A 581.98 PASS
1 69536 1:69536:C:T C T 581.98 PASS
请尝试以下操作:
awk 'BEGIN {FS = OFS = "t"} # set delimiters to a tab
!/^#/ {sub("^chr", "", $1); sub("^chr", "", $3)} # if the line does not start with "#", modify the 1st and 3rd column
1 # print the line
' ${file}.txt > ${file}_lessCHR.txt
结果:
##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69511 1:69511:A:G A G 11157600 PASS
1 69536 1:69536:C:A C A 581.98 PASS
1 69536 1:69536:C:T C T 581.98 PASS
在这里添加更多的通用解决方案,在变量cols
中给出所有列的编号,我们不需要编写这么多次替换,尝试如下:
awk -v cols="1,3" '
BEGIN{
FS=OFS="t"
num=split(cols,arr1,",")
for(i=1;i<=num;i++){
columns[arr1[i]]
}
}
!/^#/{
for(j in columns){
sub(/^chr/,"",$j)
}
}
1
' Input_file
解释:为以上内容添加详细说明。
awk -v cols="1,3" ' ##Starting awk program from here, creating cols which has all column numbers 1 and 3 comma separated here.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(cols,arr1,",") ##Splitting cols variable into arr1 with comma separated.
for(i=1;i<=num;i++){ ##Running for loop till value of num here.
columns[arr1[i]] ##Creating columns which has index as value of arr1 array here.
}
}
!/^#/{ ##Checking condition if line does not starts with # then do following.
for(j in columns){ ##Going through columns here.
sub(/^chr/,"",$j) ##Substituting starting string chr with NULL in column $j.
}
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.
如果你想删除'chr',无论它出现在该行的哪个位置,这将删除'chr'字符串,然后不打印以hash标记开始的行:
sed -e 's/chr//g' ${file}.txt | grep -v '^#' > ${file}_noCHR.txt
如果您不希望删除行中其他列中的'chr',则需要稍微修改一下sed正则表达式。
这可能适合您(GNU sed):
sed -E 's/^chr(S+s+S+s+)chr/ 1/' file
这只会改变chr
开头的行。
可作为管道的一部分:
cat oldFile | sed -E 's/^chr(S+s+S+s+)chr/ 1/' > newFile