AWK:通过匹配一个特定列，将信息从file2添加到file1的新列中的同一行

我的两个文件看起来都像这个

文件1

NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A

文件2

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

我想要的输出：

我想将file2的第一列与我的file1的第一列相匹配。如果匹配为true，那么我想添加以将file2的第三列附加到file1中的新列，以获得：

文件1

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met

这是我的尝试：

awk 'BEGIN{ FS=OFS="t" }
NR==FNR {a[$0]; next;}
{
for (k in a) {
if ($1 == k) {
print $0 "t" a[$3]
}
}
}' file1.txt file2.txt

但不会产生我想要的输出：

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

提前谢谢。

PS：file1包含唯一条目。file2经过排序，以制表符分隔，包含超过300万个条目。

编辑：

我所说的制表符分隔是指新列以制表符分隔的形式附加，但该列中的值是逗号分隔的。

请您尝试以下内容，并使用所示的示例进行编写和测试。

awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
'  Input_file2   Input_file1

如果您想将输出保存到Input_file1中，请尝试以下操作：

awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
'  Input_file2   Input_file1 > temp && mv temp Input_file1

解释：在此处添加对上述代码的详细解释。

awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk program from here.
OFS=", "                                           ##Setting OFS as comma space here.
}
FNR==NR{                                             ##Checking condition FNR==NR which will be true when Input_file2 is being read.
array[$1]=(array[$1]?array[$1] OFS:"")$NF          ##Creating array with index $1 and value is last field of line.
next                                               ##next will skip all further statements from here.
}
($1 in array){                                       ##Checking condition if 1st field of current line is present in array then do following.
print $1"t"array[$1]                              ##Printing first column TAB and then value of array with index $1 here.
}
'  Input_file2  Input_file1                          ##Mentioning Input_file names here.

另一个awk。您的示例代码说FS=OFS="t"，但输出", "，所以我使用了前者后者。此外，由于file1中只有一项在file2中匹配，因此无法满足您的预期输出。

$ awk '
BEGIN {
FS=OFS="t"                         # delims
}
NR==FNR {                               # process file1
a[$0]                               # hash key only to see if values in file2
next
}
($1 in a) {                             # if found in file1
a[$1]=a[$1] (a[$1]==""?"":", ") $3  # append to the corresponding item
}
END {                                   # in the end
for(i in a) 
if(a[i]!="")                    # print all non-empty ones
print i,a[i]
}' file1 file2

输出您的数据：

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met      NP_001258614.1:p.Thr158Met      NP_036375.1:p.Thr116Met

文件1

文件2

文件1

编辑：

相关内容

最新更新

热门标签：