我的两个文件看起来都像这个
文件1
NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A
文件2
NC_000001.11:g.100007038C>T NM_001271684.2:c.347C>T NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T NM_001271685.2:c.473C>T NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T NM_012243.3:c.347C>T NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A NM_001271684.2:c.348G>A NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A NM_001271685.2:c.474G>A NP_001258614.1:p.Thr158%3D
我想要的输出:
我想将file2
的第一列与我的file1
的第一列相匹配。如果匹配为true,那么我想添加以将file2
的第三列附加到file1
中的新列,以获得:
文件1
NC_000001.11:g.100007038C>T NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met
这是我的尝试:
awk 'BEGIN{ FS=OFS="t" }
NR==FNR {a[$0]; next;}
{
for (k in a) {
if ($1 == k) {
print $0 "t" a[$3]
}
}
}' file1.txt file2.txt
但不会产生我想要的输出:
NC_000001.11:g.100007038C>T NM_001271684.2:c.347C>T NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T NM_001271685.2:c.473C>T NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T NM_012243.3:c.347C>T NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A NM_001271684.2:c.348G>A NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A NM_001271685.2:c.474G>A NP_001258614.1:p.Thr158%3D
提前谢谢。
PS:file1
包含唯一条目。file2
经过排序,以制表符分隔,包含超过300万个条目。
编辑:
我所说的制表符分隔是指新列以制表符分隔的形式附加,但该列中的值是逗号分隔的。
请您尝试以下内容,并使用所示的示例进行编写和测试。
awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
' Input_file2 Input_file1
如果您想将输出保存到Input_file1中,请尝试以下操作:
awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"t"array[$1]
}
' Input_file2 Input_file1 > temp && mv temp Input_file1
解释:在此处添加对上述代码的详细解释。
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this awk program from here.
OFS=", " ##Setting OFS as comma space here.
}
FNR==NR{ ##Checking condition FNR==NR which will be true when Input_file2 is being read.
array[$1]=(array[$1]?array[$1] OFS:"")$NF ##Creating array with index $1 and value is last field of line.
next ##next will skip all further statements from here.
}
($1 in array){ ##Checking condition if 1st field of current line is present in array then do following.
print $1"t"array[$1] ##Printing first column TAB and then value of array with index $1 here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
另一个awk。您的示例代码说FS=OFS="t"
,但输出", "
,所以我使用了前者后者。此外,由于file1
中只有一项在file2
中匹配,因此无法满足您的预期输出。
$ awk '
BEGIN {
FS=OFS="t" # delims
}
NR==FNR { # process file1
a[$0] # hash key only to see if values in file2
next
}
($1 in a) { # if found in file1
a[$1]=a[$1] (a[$1]==""?"":", ") $3 # append to the corresponding item
}
END { # in the end
for(i in a)
if(a[i]!="") # print all non-empty ones
print i,a[i]
}' file1 file2
输出您的数据:
NC_000001.11:g.100007038C>T NP_001258613.1:p.Thr116Met NP_001258614.1:p.Thr158Met NP_036375.1:p.Thr116Met