我想使用AWK,但我似乎没有得到正确的第一条记录。我希望任何人都可以帮助把它做好。
我有这个文件,每条记录是 3 行,但有时它有 4 行(所以有 3 美元和 4 美元(。我的目标是打印每条记录的所有三行,如果有第四行,我还想打印带有第四行的前 2 行(没有第 3 行(。
我的策略是使用字符串("序列:"(作为 RS,并使用新行 (""( 作为 FS。
我的文件如下所示:
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
使用以下代码,我得到了一条混乱的第一条记录,因为字符串也在文件的开头。
awk '{ RS="Sequence: "; FS="n" }
{
if ($4 != "" )
print $1,"n",$2,"n",$3,"n",$1,"n",$2,"n",$4
else
print $1,"n",$2,"n",$3 ;
}' short.txt > test
带输出:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
1
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
所以我想我应该从输入文件中删除第一个"Sequence:"字符串,但这给出了:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
1
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
to:
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
所以第一张唱片又搞砸了。这个问题有解决方案吗?我的预期输出是最后一个输出(有或没有字符串"序列:"(,但第一条记录正确。
听起来这就是您要做的:
$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
print rec[1] ORS rec[2] ORS rec[3]
if (cnt == 4) {
print rec[1] ORS rec[2] ORS rec[4]
}
cnt=0
}
$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
尝试为此使用 RS 只会让您的生活更加艰难,并且生成的代码不可移植(仅限 gawk(
您的代码可以轻松修复为:
BEGIN{ RS="Sequence: "; FS="n" }
(NR==1){next}
{
if ($4 != "" )
print $1,"n",$2,"n",$3,"n",$1,"n",$2,"n",$4
else
print $1,"n",$2,"n",$3 ;
}
第一条记录将为空,这就是为什么它被跳过的原因next
.
您在第一条记录中遇到问题的原因是您在读取第一条记录后定义了RS
和FS
(即不在BEGIN
块中,该块在完成任何操作之前发生(
但你真正想要的,只是为了确定,RS="(^|n)Sequence: "
这只是为了确保它从行或文件的开头开始。