我有一个pdb文件,它看起来像这样-
ATOM 1737 HG13 VAL X 121 21.938 -9.234 -0.977 0.00 0.00 SYST
ATOM 1738 CG2 VAL X 121 21.679 -7.988 1.521 0.00 0.00 SYST
ATOM 1739 HG21 VAL X 121 22.611 -7.674 1.050 0.00 0.00 SYST
ATOM 1740 HG22 VAL X 121 21.340 -7.213 2.207 0.00 0.00 SYST
ATOM 1741 HG23 VAL X 121 21.863 -8.892 2.102 0.00 0.00 SYST
ATOM 1742 C VAL X 121 19.373 -7.193 -1.494 1.00 0.00 SYST
ATOM 1743 O VAL X 121 19.712 -7.180 -2.665 1.00 0.00 SYST
ATOM 1744 OXT VAL X 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS X 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS X 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS X 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS X 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS X 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS X 122 3.361 -1.596 -17.626 0.00 0.00 SYST
每当在第3个字段中发现"OXT"时,我都会尝试将第5个字段中的"X"更改为"Y"。我已经用awk命令写了以下内容-
awk '$3 == "OXT" {check=!check}check{sub(/X/,"Y",$5)}1' 1vwetest.pdb >
1vwetestoutput.pdb
然而,这改变了我的输入文件的格式,就像这样-
ATOM 1737 HG13 VAL X 121 21.938 -9.234 -0.977 0.00 0.00 SYST
ATOM 1738 CG2 VAL X 121 21.679 -7.988 1.521 0.00 0.00 SYST
ATOM 1739 HG21 VAL X 121 22.611 -7.674 1.050 0.00 0.00 SYST
ATOM 1740 HG22 VAL X 121 21.340 -7.213 2.207 0.00 0.00 SYST
ATOM 1741 HG23 VAL X 121 21.863 -8.892 2.102 0.00 0.00 SYST
ATOM 1742 C VAL X 121 19.373 -7.193 -1.494 1.00 0.00 SYST
ATOM 1743 O VAL X 121 19.712 -7.180 -2.665 1.00 0.00 SYST
ATOM 1744 OXT VAL Y 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS Y 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS Y 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS Y 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS Y 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS Y 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS Y 122 3.361 -1.596 -17.626 0.00 0.00 SYST
替换值后如何保留列宽?或者还有其他方法可以做到这一点吗?
awk
并不真正关心空白的数量,您也不应该关心。与其试图精确匹配输入,不如用制表符替换记录分隔符。例如:
awk '$3 == "OXT" {c=!c} {sub(/X/, c ? "Y" : "X",$5)}1' OFS='t' input
问题是,为了进行替换,您需要修改每一行,但这并不是太大的问题。
但在您的情况下,也很容易将每个单独的字符视为一个字段,并使用保持空白的精确性
awk '$14$15$16 == "OXT" {c=!c} {sub(/X/, c ? "Y" : "X",$22)}1' FS= OFS= input
如果第3列的对齐将OXT从第14-16列中移出,这将不起作用,但这可能对您有效。
使用GNUawk
的match
函数,您可以进行替换,也可以保持与之前相同的空间(仅使用所示示例编写和测试(。
以下是此解决方案中显示的regex的在线演示。
awk '
match($0,/^([^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+)([^[:space:]]+)([[:space:]]+[^[:space:]]+[[:space:]]+)([^[:space:]]+)(.*)$/,arr){
if(arr[2]=="OXT"){ arr[4]="Y" }
print arr[1] arr[2] arr[3] arr[4] arr[5]
}
' Input_file
解释:添加所用正则表达式的详细解释:
^([^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+) ##Matching from starting of the value of line non-spaces(1 or more occurrences) followed by:
##Spaces followed by 1 or more non-spaces followed by spaces.
##Basically its capturing eg: (ATOM 1737 ) value here.
([^[:space:]]+) ##Creating 2nd capturing group which has 1 or more non-spaces in it, this is the part which:
##needs to be checked either its OXT or not as per requirement.
([[:space:]]+[^[:space:]]+[[:space:]]+) ##Creating 3rd capturing group where matching spaces followed by non-spaces followed by spaces
([^[:space:]]+) ##Creating 4th capturing group which has non-capturing group. This contains values in it:
##Which needs to be changed as per value in 2nd capturing group.
(.*)$ ##Creating 5th capturing group which has everything else of value till end of line.
使用GNU awk
和FIELDWIDTHS
(假设您的输入是固定宽度,如示例所示(:
awk -v FIELDWIDTHS='12 5 4 1 *' '$2==" OXT "{f=1} f{$4="Y"} {print $1 $2 $3 $4 $5}'
第一个字段为12个字符,第二个字段为5个字符,依此类推。*
表示要将剩余字符分配给该字段。
请注意,我使用了f=1
而不是f=!f
,因为您似乎希望在找到OXT
后将所有X
更改为Y
。
如果适用,使用sed
$ sed -E 's/(([^ ]*( +|t)){2}OXT ([^ ]*( +|t)))X/1Y/' input_file
要从显示的输入中获得显示的输出,只需使用任何awk:
$ awk '$3=="OXT" { sub(/ X /," Y ") } 1' file
ATOM 1737 HG13 VAL X 121 21.938 -9.234 -0.977 0.00 0.00 SYST
ATOM 1738 CG2 VAL X 121 21.679 -7.988 1.521 0.00 0.00 SYST
ATOM 1739 HG21 VAL X 121 22.611 -7.674 1.050 0.00 0.00 SYST
ATOM 1740 HG22 VAL X 121 21.340 -7.213 2.207 0.00 0.00 SYST
ATOM 1741 HG23 VAL X 121 21.863 -8.892 2.102 0.00 0.00 SYST
ATOM 1742 C VAL X 121 19.373 -7.193 -1.494 1.00 0.00 SYST
ATOM 1743 O VAL X 121 19.712 -7.180 -2.665 1.00 0.00 SYST
ATOM 1744 OXT VAL Y 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS X 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS X 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS X 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS X 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS X 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS X 122 3.361 -1.596 -17.626 0.00 0.00 SYST
或者,如果您需要处理示例中显示的其他情况,那么这可能就是您所需要的,使用GNU awk作为match()
和S/s
的第三个arg:
$ awk '($3=="OXT") && match($0,/((S+s+){4}).(.*)/,a) { $0=a[1] "Y" a[3] } 1' file
ATOM 1737 HG13 VAL X 121 21.938 -9.234 -0.977 0.00 0.00 SYST
ATOM 1738 CG2 VAL X 121 21.679 -7.988 1.521 0.00 0.00 SYST
ATOM 1739 HG21 VAL X 121 22.611 -7.674 1.050 0.00 0.00 SYST
ATOM 1740 HG22 VAL X 121 21.340 -7.213 2.207 0.00 0.00 SYST
ATOM 1741 HG23 VAL X 121 21.863 -8.892 2.102 0.00 0.00 SYST
ATOM 1742 C VAL X 121 19.373 -7.193 -1.494 1.00 0.00 SYST
ATOM 1743 O VAL X 121 19.712 -7.180 -2.665 1.00 0.00 SYST
ATOM 1744 OXT VAL Y 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS X 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS X 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS X 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS X 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS X 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS X 122 3.361 -1.596 -17.626 0.00 0.00 SYST
或者使用任何POSIX awk,并假设字段之间的空格是空格,因为如果它们是制表符,那么我们就不需要做任何这样的事情:
$ awk '($3=="OXT") && match($0,/([^ ]+ +){4}/) { $0=substr($0,1,RLENGTH) "Y" substr($0,RLENGTH+2) } 1' file
ATOM 1737 HG13 VAL X 121 21.938 -9.234 -0.977 0.00 0.00 SYST
ATOM 1738 CG2 VAL X 121 21.679 -7.988 1.521 0.00 0.00 SYST
ATOM 1739 HG21 VAL X 121 22.611 -7.674 1.050 0.00 0.00 SYST
ATOM 1740 HG22 VAL X 121 21.340 -7.213 2.207 0.00 0.00 SYST
ATOM 1741 HG23 VAL X 121 21.863 -8.892 2.102 0.00 0.00 SYST
ATOM 1742 C VAL X 121 19.373 -7.193 -1.494 1.00 0.00 SYST
ATOM 1743 O VAL X 121 19.712 -7.180 -2.665 1.00 0.00 SYST
ATOM 1744 OXT VAL Y 121 18.180 -7.240 -1.203 0.00 0.00 SYST
ATOM 1745 N CYS X 122 3.096 -0.678 -19.522 0.00 0.00 SYST
ATOM 1746 H1 CYS X 122 2.977 0.322 -19.592 0.00 0.00 SYST
ATOM 1747 H2 CYS X 122 2.198 -1.101 -19.340 0.00 0.00 SYST
ATOM 1748 H3 CYS X 122 3.654 -0.993 -20.303 0.00 0.00 SYST
ATOM 1749 CZ CYS X 122 3.913 -0.961 -18.319 0.00 0.00 SYST
ATOM 1750 HA CYS X 122 3.361 -1.596 -17.626 0.00 0.00 SYST
只要在POSIX版本中将[^ ]+ +
更改为[^[:space:]]+[[:space:]]+
,如果可能存在选项卡,那么gawk版本就已经处理了它们。