如何使用awk来匹配另一个文件中一个文件的多个键,并将第二个文件中的值打印到第一个文件中



我需要将源文件中的2列与引用文件中的两列匹配,并打印引用文件中第三列与源文件中所有列。源文件中的每对(约150000行(在引用文件中只出现一次(约15000000行(,并且文件很大,所以我还需要停止搜索第一个实例之后的第二个文件(如grep-m 1(。我已经尝试了几次awk,只使用一个搜索键就可以进行搜索,但我需要两个键,因为两个键本身都不不同,但两个键是不同的。引用文件太大,无法加载到R或python中(25G作为gzipped文件(。

file 1 (source, multiple columns, 150K lines):
CHR SNP BP INFO(multiple other columns)
1 ABS141 132156 Random_stuff
2 GSD1151 132143 Random_stuff
3 KJH173 465879 Random_stuff
file 2 (reference, three columns, 25Gb gzipped):
CHR POS ID
1 132156 rid1
1 654987 rid2
2 132143 rid3
2 787987 rid4
3 465879 rid5
desired output file (all columns from file 1 + column 3 from file 2):
CHR SNP BP INFO(columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5
Approaches tried:
awk 'NR==FNR {label[$1,$2]=$3; next} (sst[$1,$3]=label[$1,$2]){print $0, label[$1,$2]}' file2 file1 > out_file
Result = empty file
awk 'NR==FNR {seen[$1,$2]=$3; next} NR{print $0, seen[$1,$3]}' file2 file1 > out_file
Result = empty file
awk 'NR==FNR {label[$1,$2]=$3; next} ($1 SUBSEP $3 in label){print $0, label[$1,$2]}' file2 file1 > out_file
Result: empty file

awk 'NR==FNR {label[$1,$2]=$3; next} out[$1,$3] in label {print $0, label[$1,$2]}' file2 file1 > out_file
Result: empty file
awk 'NR==FNR {seen[$2]=$3; next} NF{print $0, seen[$3]}' file2 file1 > out_file
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5
Result = file with ID placed correctly into file 1 as new column, but only uses 1 key (POS) instead of 2 keys (CHR + POS).

对OP的第一次awk尝试进行一些调整:

awk '
NR==FNR          { if (FNR==1) $2="BP"                 # insure we can match on 2nd file header row
label[$1,$2]=$3
next
}
($1,$3) in label { print $0, label[$1,$3] }
' file2 file1

这将生成:

CHR SNP BP INFO(multiple other columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5

注意:这假设OP可以适应内存中的所有file2;如果这是一个无效的假设(例如,OP的代码内存不足(,那么可能。。。


假设整个预期结果(即file1,加上来自file2的字段#3,file1字段#1&#3的毛绒哈希(可以放入内存,并且我们需要保持来自file1的行的顺序。。。

一个awk的想法需要对每个输入文件进行一次遍历:

awk '
FNR==NR { ndx=$1 FS $3
if (FNR==1) ndx = "CHR" FS "POS"     # override ndx to match header from 2nd file
lines[ndx]=$0                        # save current line in memory
order[FNR]=ndx                       # save order of current line
maxFNR=FNR                           # keep track of total number of lines from 1st file
next
}
{ ndx=$1 FS $2
if (ndx in lines)                    # if there is a match in the lines[] array then ...
lines[ndx]=lines[ndx] FS $3       # append current field #3 to lines[] entry
}
END     { for (i=1;i<=maxFNR;i++)              # loop through lines from 1st file and ...
print lines[order[i]]            # print to stdout
}
' file1 file2

这将生成:

CHR SNP BP INFO(multiple other columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5

相关内容

最新更新