如何从unix中行列值中删除字符串



我有一个50 GB大小的文本file,如下所示。我想从第一列和第三列删除chr,跳过以#开头的任何行。我知道我可以用这种方式添加chr,但不确定如何删除它们cat ${file}.txt | awk -F"t" '{if ($0 !~ /^#/) {print "chr"$0} else{print $0}}' > ${file}_moreCHR.txt

file:

##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    
chr1    69511   chr1:69511:A:G  A       G       11157600        PASS   
chr1    69536   chr1:69536:C:A  C       A       581.98  PASS    
chr1    69536   chr1:69536:C:T  C       T       581.98  PASS

Result I want:

##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    
1    69511   1:69511:A:G  A       G       11157600        PASS   
1    69536   1:69536:C:A  C       A       581.98  PASS    
1    69536   1:69536:C:T  C       T       581.98  PASS

请尝试以下操作:

awk 'BEGIN {FS = OFS = "t"}                            # set delimiters to a tab
!/^#/ {sub("^chr", "", $1); sub("^chr", "", $3)}    # if the line does not start with "#", modify the 1st and 3rd column
1                                                   # print the line
' ${file}.txt > ${file}_lessCHR.txt

结果:

##contig=<ID=HLA-DRB1*>
##reference=file:////Homo_sapiens_assembly38.fasta
##source=ApplyVQSR
##source=SelectVariants
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   69511   1:69511:A:G A   G   11157600    PASS
1   69536   1:69536:C:A C   A   581.98  PASS
1   69536   1:69536:C:T C   T   581.98  PASS

在这里添加更多的通用解决方案,在变量cols中给出所有列的编号,我们不需要编写这么多次替换,尝试如下:

awk -v cols="1,3" '
BEGIN{
FS=OFS="t"
num=split(cols,arr1,",")
for(i=1;i<=num;i++){
columns[arr1[i]]
}
}
!/^#/{
for(j in columns){
sub(/^chr/,"",$j)
}
}
1
' Input_file

解释:为以上内容添加详细说明。

awk -v cols="1,3" '           ##Starting awk program from here, creating cols which has all column numbers 1 and 3 comma separated here.
BEGIN{                        ##Starting BEGIN section of this program from here.
num=split(cols,arr1,",")    ##Splitting cols variable into arr1 with comma separated.
for(i=1;i<=num;i++){        ##Running for loop till value of num here.
columns[arr1[i]]          ##Creating columns which has index as value of arr1 array here.
}
}
!/^#/{                        ##Checking condition if line does not starts with # then do following.
for(j in columns){          ##Going through columns here.
sub(/^chr/,"",$j)         ##Substituting starting string chr with NULL in column $j.
}
}
1                             ##Printing current line here.
' Input_file                  ##Mentioning Input_file name here.

如果你想删除'chr',无论它出现在该行的哪个位置,这将删除'chr'字符串,然后不打印以hash标记开始的行:

sed -e 's/chr//g' ${file}.txt | grep -v '^#' > ${file}_noCHR.txt

如果您不希望删除行中其他列中的'chr',则需要稍微修改一下sed正则表达式。

这可能适合您(GNU sed):

sed -E 's/^chr(S+s+S+s+)chr/    1/' file

这只会改变chr开头的行。

可作为管道的一部分:

cat oldFile | sed -E 's/^chr(S+s+S+s+)chr/    1/' > newFile

相关内容

  • 没有找到相关文章

最新更新