AWK:通过匹配一个特定列,将信息从file2添加到file1的新列中的同一行



我的两个文件看起来都像这个

文件1

NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A

文件2

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

我想要的输出:

我想将file2的第一列与我的file1的第一列相匹配。如果匹配为true,那么我想添加以将file2的第三列附加到file1中的新列,以获得:

文件1

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met

这是我的尝试:

awk 'BEGIN{ FS=OFS="t" }
NR==FNR {a[$0]; next;}
{
for (k in a) {
if ($1 == k) {
print $0 "t" a[$3]
}
}
}' file1.txt file2.txt

但不会产生我想要的输出:

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

提前谢谢。

PSfile1包含唯一条目。file2经过排序,以制表符分隔,包含超过300万个条目。

编辑:

我所说的制表符分隔是指新列以制表符分隔的形式附加,但该列中的值是逗号分隔的。

请您尝试以下内容,并使用所示的示例进行编写和测试。

awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
'  Input_file2   Input_file1

如果您想将输出保存到Input_file1中,请尝试以下操作:

awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
'  Input_file2   Input_file1 > temp && mv temp Input_file1

解释:在此处添加对上述代码的详细解释。

awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk program from here.
OFS=", "                                           ##Setting OFS as comma space here.
}
FNR==NR{                                             ##Checking condition FNR==NR which will be true when Input_file2 is being read.
array[$1]=(array[$1]?array[$1] OFS:"")$NF          ##Creating array with index $1 and value is last field of line.
next                                               ##next will skip all further statements from here.
}
($1 in array){                                       ##Checking condition if 1st field of current line is present in array then do following.
print $1"t"array[$1]                              ##Printing first column TAB and then value of array with index $1 here.
}
'  Input_file2  Input_file1                          ##Mentioning Input_file names here.

另一个awk。您的示例代码说FS=OFS="t",但输出", ",所以我使用了前者后者。此外,由于file1中只有一项在file2中匹配,因此无法满足您的预期输出。

$ awk '
BEGIN {
FS=OFS="t"                         # delims
}
NR==FNR {                               # process file1
a[$0]                               # hash key only to see if values in file2
next
}
($1 in a) {                             # if found in file1
a[$1]=a[$1] (a[$1]==""?"":", ") $3  # append to the corresponding item
}
END {                                   # in the end
for(i in a) 
if(a[i]!="")                    # print all non-empty ones
print i,a[i]
}' file1 file2

输出您的数据:

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met      NP_001258614.1:p.Thr158Met      NP_036375.1:p.Thr116Met

最新更新