如何将带有字符串的 AWK 用作 RS?



我想使用AWK,但我似乎没有得到正确的第一条记录。我希望任何人都可以帮助把它做好。

我有这个文件,每条记录是 3 行,但有时它有 4 行(所以有 3 美元和 4 美元(。我的目标是打印每条记录的所有三行,如果有第四行,我还想打印带有第四行的前 2 行(没有第 3 行(。

我的策略是使用字符串("序列:"(作为 RS,并使用新行 (""( 作为 FS。

我的文件如下所示:

Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

使用以下代码,我得到了一条混乱的第一条记录,因为字符串也在文件的开头。

awk '{ RS="Sequence: "; FS="n" }
{
if ($4 != "" )
print $1,"n",$2,"n",$3,"n",$1,"n",$2,"n",$4
else
print $1,"n",$2,"n",$3 ;
}' short.txt > test 

带输出:

Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
1
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

所以我想我应该从输入文件中删除第一个"Sequence:"字符串,但这给出了:

X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
1
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
to:
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

所以第一张唱片又搞砸了。这个问题有解决方案吗?我的预期输出是最后一个输出(有或没有字符串"序列:"(,但第一条记录正确。

听起来这就是您要做的:

$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
print rec[1] ORS rec[2] ORS rec[3]
if (cnt == 4) {
print rec[1] ORS rec[2] ORS rec[4]
}
cnt=0
}
$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

尝试为此使用 RS 只会让您的生活更加艰难,并且生成的代码不可移植(仅限 gawk(

您的代码可以轻松修复为:

BEGIN{ RS="Sequence: "; FS="n" }
(NR==1){next}
{
if ($4 != "" )
print $1,"n",$2,"n",$3,"n",$1,"n",$2,"n",$4
else
print $1,"n",$2,"n",$3 ;
}

第一条记录将为空,这就是为什么它被跳过的原因next.

您在第一条记录中遇到问题的原因是您在读取第一条记录后定义了RSFS(即不在BEGIN块中,该块在完成任何操作之前发生(

但你真正想要的,只是为了确定,RS="(^|n)Sequence: "这只是为了确保它从行或文件的开头开始。

最新更新