我有两个文件1和2。文件1包含以C和D开头的代谢途径的所有详细信息,并且包含大量的C和D,而文件2仅包含名称以C开头且具有唯一性(入围C,数量较少(的特定ID行。文件如下:
文件1:
C 00010 Glycolysis / Gluconeogenesis [PATH:smup00010]
D SMPSPU_277 pfkA; 6-phosphofructokinase K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D SMPSPU_278 gapA; glyceraldehyde 3-phosphate dehydrogenase K00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
D SMPSPU_274 acoA; pyruvate dehydrogenase E1 component subunit
alpha K00161 PDHA; pyruvate dehydrogenase E1 component alpha subunit
[EC:1.2.4.1]
D SMPSPU_172 korA; 2-oxoglutarate ferredoxin oxidoreductase subunit alpha K00174 korA; 2-oxoglutarate/2-oxoacid ferredoxin oxidoreductase subunit alpha [EC:1.2.7.3 1.2.7.11]
D SMPSPU_061 korB; 2-oxoglutarate ferredoxin oxidoreductase subunit beta K00175 korB; 2-oxoglutarate/2-oxoacid ferredoxin oxidoreductase subunit beta [EC:1.2.7.3 1.2.7.11]
C 00020 Citrate cycle (TCA cycle) [PATH:smup00020]
D SMPSPU_201 sucA; 2-oxoglutarate dehydrogenase, E1 component K00164 OGDH; 2-oxoglutarate dehydrogenase E1 component [EC:1.2.4.2]
D SMPSPU_120 lpdA; dihydrolipoamide dehydrogenase K00382 DLD; dihydrolipoamide dehydrogenase [EC:1.8.1.4]
D SMPSPU_172 korA; 2-oxoglutarate ferredoxin oxidoreductase subunit alpha K00174 korA; 2-oxoglutarate/2-oxoacid ferredoxin oxidoreductase subunit alpha [EC:1.2.7.3 1.2.7.11]
D SMPSPU_169 sucD; succinyl-CoA synthetase subunit alpha K01902 sucD; succinyl-CoA synthetase alpha subunit [EC:6.2.1.5]
D SMPSPU_229 pdhB; pyruvate dehydrogenase E1 component subunit beta K00162 PDHB; pyruvate dehydrogenase E1 component beta subunit [EC:1.2.4.1]
D SMPSPU_275 pdhC; dihydrolipoamide acyltransferase E2 component K00627 DLAT; pyruvate dehydrogenase E2 component (dihydrolipoamide acetyltransferase) [EC:2.3.1.12]
C 00030 Pentose phosphate pathway [PATH:smup00030]
D SMPSPU_057 tktB; transketolase, N-terminal subunit K00615 E2.2.1.1; transketolase [EC:2.2.1.1]
D SMPSPU_058 tktA; transketolase, C-terminal subunit K00615 E2.2.1.1; transketolase [EC:2.2.1.1]
C 00051 Fructose and mannose metabolism [PATH:smup00051]
D SMPSPU_277 pfkA; 6-phosphofructokinase K00850 pfkA; 6-phosphofructokinase 1 [EC:2.7.1.11]
D SMPSPU_230 fbaA; fructose-bisphosphate aldolase K01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
文件2:
C 00261 Monobactam biosynthesis [PATH:smup00261]
C 00300 Lysine biosynthesis [PATH:smup00300]
C 00660 C5-Branched dibasic acid metabolism [PATH:smup00660]
C 00680 Methane metabolism [PATH:smup00680]
C 02020 Two-component system [PATH:smup02020]
C 02024 Quorum sensing [PATH:smup02024]
现在我只想提取文件2中存在的那些C和它们各自的D。
我试过这个脚本
fgrep -f name-C-non-homowba00001 wba00001.keg |grep -E '^C.*PATH|^D' | less
但我给了我这个C id和名称文件。
试试这个:
cat input | grep -E '^[CD]' | sed -n '/^C.*PATH/,/^C/p' | uniq -f2 | grep -E '^C.*PATH|^D'
其中:
input
是您的文件- 第一个
grep
打印以C或D开头的所有行 sed
打印从以C开头并包含PATH的一行到以C开头的下一行(包括在内(的所有行uniq
抑制除前2个字段之外的所有相等的相邻行- 最后一个
grep
打印所有以C开头并包含PATH或以D开头的行
awk '$1!~/^D$/ { select=0; } $1=="C" && $NF~/PATH/ { select=1; } {if(select) print; }' inputfile
说明:
$1!~/^D$/ { select=0; }
除D
以外的线路停止输出$1=="C" && $NF~/PATH/ { select=1; }
最后一个字段中包含PATH
的C
行开始输出{if(select) print; }
如果选择输出,则打印当前行。
这是一种安全的方法:
awk '(NR==FNR){a[$0];next}/^C/{p=($0 in a)}p' file2 file1