使用awk命令后保留输入文件的格式



我有一个pdb文件,它看起来像这样-

ATOM   1737 HG13 VAL X 121      21.938  -9.234  -0.977  0.00  0.00      SYST  
ATOM   1738  CG2 VAL X 121      21.679  -7.988   1.521  0.00  0.00      SYST  
ATOM   1739 HG21 VAL X 121      22.611  -7.674   1.050  0.00  0.00      SYST  
ATOM   1740 HG22 VAL X 121      21.340  -7.213   2.207  0.00  0.00      SYST  
ATOM   1741 HG23 VAL X 121      21.863  -8.892   2.102  0.00  0.00      SYST  
ATOM   1742  C   VAL X 121      19.373  -7.193  -1.494  1.00  0.00      SYST  
ATOM   1743  O   VAL X 121      19.712  -7.180  -2.665  1.00  0.00      SYST  
ATOM   1744  OXT VAL X 121      18.180  -7.240  -1.203  0.00  0.00      SYST  
ATOM   1745  N   CYS X 122       3.096  -0.678 -19.522  0.00  0.00      SYST  
ATOM   1746  H1  CYS X 122       2.977   0.322 -19.592  0.00  0.00      SYST  
ATOM   1747  H2  CYS X 122       2.198  -1.101 -19.340  0.00  0.00      SYST  
ATOM   1748  H3  CYS X 122       3.654  -0.993 -20.303  0.00  0.00      SYST  
ATOM   1749  CZ  CYS X 122       3.913  -0.961 -18.319  0.00  0.00      SYST  
ATOM   1750  HA  CYS X 122       3.361  -1.596 -17.626  0.00  0.00      SYST  

每当在第3个字段中发现"OXT"时,我都会尝试将第5个字段中的"X"更改为"Y"。我已经用awk命令写了以下内容-

awk '$3 == "OXT" {check=!check}check{sub(/X/,"Y",$5)}1' 1vwetest.pdb > 
1vwetestoutput.pdb

然而,这改变了我的输入文件的格式,就像这样-

ATOM   1737 HG13 VAL X 121      21.938  -9.234  -0.977  0.00  0.00      SYST  
ATOM   1738  CG2 VAL X 121      21.679  -7.988   1.521  0.00  0.00      SYST  
ATOM   1739 HG21 VAL X 121      22.611  -7.674   1.050  0.00  0.00      SYST  
ATOM   1740 HG22 VAL X 121      21.340  -7.213   2.207  0.00  0.00      SYST  
ATOM   1741 HG23 VAL X 121      21.863  -8.892   2.102  0.00  0.00      SYST  
ATOM   1742  C   VAL X 121      19.373  -7.193  -1.494  1.00  0.00      SYST  
ATOM   1743  O   VAL X 121      19.712  -7.180  -2.665  1.00  0.00      SYST  
ATOM 1744 OXT VAL Y 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS Y 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS Y 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS Y 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS Y 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS Y 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS Y 122 3.361 -1.596 -17.626 0.00 0.00 SYST

替换值后如何保留列宽?或者还有其他方法可以做到这一点吗?

awk并不真正关心空白的数量,您也不应该关心。与其试图精确匹配输入,不如用制表符替换记录分隔符。例如:

awk '$3 == "OXT" {c=!c} {sub(/X/, c ? "Y" : "X",$5)}1' OFS='t' input

问题是,为了进行替换,您需要修改每一行,但这并不是太大的问题。

但在您的情况下,也很容易将每个单独的字符视为一个字段,并使用保持空白的精确性

awk '$14$15$16 == "OXT" {c=!c} {sub(/X/, c ? "Y" : "X",$22)}1' FS= OFS= input

如果第3列的对齐将OXT从第14-16列中移出,这将不起作用,但这可能对您有效。

使用GNUawkmatch函数,您可以进行替换,也可以保持与之前相同的空间(仅使用所示示例编写和测试(。

以下是此解决方案中显示的regex的在线演示。

awk '
match($0,/^([^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+)([^[:space:]]+)([[:space:]]+[^[:space:]]+[[:space:]]+)([^[:space:]]+)(.*)$/,arr){
if(arr[2]=="OXT"){ arr[4]="Y" }
print arr[1] arr[2] arr[3] arr[4] arr[5]
}
' Input_file

解释:添加所用正则表达式的详细解释:

^([^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+) ##Matching from starting of the value of line non-spaces(1 or more occurrences) followed by:
##Spaces followed by 1 or more non-spaces followed by spaces.
##Basically its capturing eg: (ATOM   1737 ) value here.
([^[:space:]]+)                                       ##Creating 2nd capturing group which has 1 or more non-spaces in it, this is the part which:
##needs to be checked either its OXT or not as per requirement.
([[:space:]]+[^[:space:]]+[[:space:]]+)               ##Creating 3rd capturing group where matching spaces followed by non-spaces followed by spaces
([^[:space:]]+)                                       ##Creating 4th capturing group which has non-capturing group. This contains values in it:
##Which needs to be changed as per value in 2nd capturing group.
(.*)$                                                 ##Creating 5th capturing group which has everything else of value till end of line.

使用GNU awkFIELDWIDTHS(假设您的输入是固定宽度,如示例所示(:

awk -v FIELDWIDTHS='12 5 4 1 *' '$2==" OXT "{f=1} f{$4="Y"} {print $1 $2 $3 $4 $5}'

第一个字段为12个字符,第二个字段为5个字符,依此类推。*表示要将剩余字符分配给该字段。

请注意,我使用了f=1而不是f=!f,因为您似乎希望在找到OXT后将所有X更改为Y

如果适用,使用sed

$ sed -E 's/(([^ ]*( +|t)){2}OXT ([^ ]*( +|t)))X/1Y/' input_file

要从显示的输入中获得显示的输出,只需使用任何awk:

$ awk '$3=="OXT" { sub(/ X /," Y ") } 1' file
ATOM   1737 HG13 VAL X 121      21.938  -9.234  -0.977  0.00  0.00      SYST
ATOM   1738  CG2 VAL X 121      21.679  -7.988   1.521  0.00  0.00      SYST
ATOM   1739 HG21 VAL X 121      22.611  -7.674   1.050  0.00  0.00      SYST
ATOM   1740 HG22 VAL X 121      21.340  -7.213   2.207  0.00  0.00      SYST
ATOM   1741 HG23 VAL X 121      21.863  -8.892   2.102  0.00  0.00      SYST
ATOM   1742  C   VAL X 121      19.373  -7.193  -1.494  1.00  0.00      SYST
ATOM   1743  O   VAL X 121      19.712  -7.180  -2.665  1.00  0.00      SYST
ATOM   1744  OXT VAL Y 121      18.180  -7.240  -1.203  0.00  0.00      SYST
ATOM   1745  N   CYS X 122       3.096  -0.678 -19.522  0.00  0.00      SYST
ATOM   1746  H1  CYS X 122       2.977   0.322 -19.592  0.00  0.00      SYST
ATOM   1747  H2  CYS X 122       2.198  -1.101 -19.340  0.00  0.00      SYST
ATOM   1748  H3  CYS X 122       3.654  -0.993 -20.303  0.00  0.00      SYST
ATOM   1749  CZ  CYS X 122       3.913  -0.961 -18.319  0.00  0.00      SYST
ATOM   1750  HA  CYS X 122       3.361  -1.596 -17.626  0.00  0.00      SYST

或者,如果您需要处理示例中显示的其他情况,那么这可能就是您所需要的,使用GNU awk作为match()S/s的第三个arg:

$ awk '($3=="OXT") && match($0,/((S+s+){4}).(.*)/,a) { $0=a[1] "Y" a[3] } 1' file
ATOM   1737 HG13 VAL X 121      21.938  -9.234  -0.977  0.00  0.00      SYST
ATOM   1738  CG2 VAL X 121      21.679  -7.988   1.521  0.00  0.00      SYST
ATOM   1739 HG21 VAL X 121      22.611  -7.674   1.050  0.00  0.00      SYST
ATOM   1740 HG22 VAL X 121      21.340  -7.213   2.207  0.00  0.00      SYST
ATOM   1741 HG23 VAL X 121      21.863  -8.892   2.102  0.00  0.00      SYST
ATOM   1742  C   VAL X 121      19.373  -7.193  -1.494  1.00  0.00      SYST
ATOM   1743  O   VAL X 121      19.712  -7.180  -2.665  1.00  0.00      SYST
ATOM   1744  OXT VAL Y 121      18.180  -7.240  -1.203  0.00  0.00      SYST
ATOM   1745  N   CYS X 122       3.096  -0.678 -19.522  0.00  0.00      SYST
ATOM   1746  H1  CYS X 122       2.977   0.322 -19.592  0.00  0.00      SYST
ATOM   1747  H2  CYS X 122       2.198  -1.101 -19.340  0.00  0.00      SYST
ATOM   1748  H3  CYS X 122       3.654  -0.993 -20.303  0.00  0.00      SYST
ATOM   1749  CZ  CYS X 122       3.913  -0.961 -18.319  0.00  0.00      SYST
ATOM   1750  HA  CYS X 122       3.361  -1.596 -17.626  0.00  0.00      SYST

或者使用任何POSIX awk,并假设字段之间的空格是空格,因为如果它们是制表符,那么我们就不需要做任何这样的事情:

$ awk '($3=="OXT") && match($0,/([^ ]+ +){4}/) { $0=substr($0,1,RLENGTH) "Y" substr($0,RLENGTH+2) } 1' file
ATOM   1737 HG13 VAL X 121      21.938  -9.234  -0.977  0.00  0.00      SYST
ATOM   1738  CG2 VAL X 121      21.679  -7.988   1.521  0.00  0.00      SYST
ATOM   1739 HG21 VAL X 121      22.611  -7.674   1.050  0.00  0.00      SYST
ATOM   1740 HG22 VAL X 121      21.340  -7.213   2.207  0.00  0.00      SYST
ATOM   1741 HG23 VAL X 121      21.863  -8.892   2.102  0.00  0.00      SYST
ATOM   1742  C   VAL X 121      19.373  -7.193  -1.494  1.00  0.00      SYST
ATOM   1743  O   VAL X 121      19.712  -7.180  -2.665  1.00  0.00      SYST
ATOM   1744  OXT VAL Y 121      18.180  -7.240  -1.203  0.00  0.00      SYST
ATOM   1745  N   CYS X 122       3.096  -0.678 -19.522  0.00  0.00      SYST
ATOM   1746  H1  CYS X 122       2.977   0.322 -19.592  0.00  0.00      SYST
ATOM   1747  H2  CYS X 122       2.198  -1.101 -19.340  0.00  0.00      SYST
ATOM   1748  H3  CYS X 122       3.654  -0.993 -20.303  0.00  0.00      SYST
ATOM   1749  CZ  CYS X 122       3.913  -0.961 -18.319  0.00  0.00      SYST
ATOM   1750  HA  CYS X 122       3.361  -1.596 -17.626  0.00  0.00      SYST

只要在POSIX版本中将[^ ]+ +更改为[^[:space:]]+[[:space:]]+,如果可能存在选项卡,那么gawk版本就已经处理了它们。

最新更新