我需要将源文件中的2列与引用文件中的两列匹配,并打印引用文件中第三列与源文件中所有列。源文件中的每对(约150000行(在引用文件中只出现一次(约15000000行(,并且文件很大,所以我还需要停止搜索第一个实例之后的第二个文件(如grep-m 1(。我已经尝试了几次awk,只使用一个搜索键就可以进行搜索,但我需要两个键,因为两个键本身都不不同,但两个键是不同的。引用文件太大,无法加载到R或python中(25G作为gzipped文件(。
file 1 (source, multiple columns, 150K lines):
CHR SNP BP INFO(multiple other columns)
1 ABS141 132156 Random_stuff
2 GSD1151 132143 Random_stuff
3 KJH173 465879 Random_stuff
file 2 (reference, three columns, 25Gb gzipped):
CHR POS ID
1 132156 rid1
1 654987 rid2
2 132143 rid3
2 787987 rid4
3 465879 rid5
desired output file (all columns from file 1 + column 3 from file 2):
CHR SNP BP INFO(columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5
Approaches tried:
awk 'NR==FNR {label[$1,$2]=$3; next} (sst[$1,$3]=label[$1,$2]){print $0, label[$1,$2]}' file2 file1 > out_file
Result = empty file
awk 'NR==FNR {seen[$1,$2]=$3; next} NR{print $0, seen[$1,$3]}' file2 file1 > out_file
Result = empty file
awk 'NR==FNR {label[$1,$2]=$3; next} ($1 SUBSEP $3 in label){print $0, label[$1,$2]}' file2 file1 > out_file
Result: empty file
awk 'NR==FNR {label[$1,$2]=$3; next} out[$1,$3] in label {print $0, label[$1,$2]}' file2 file1 > out_file
Result: empty file
awk 'NR==FNR {seen[$2]=$3; next} NF{print $0, seen[$3]}' file2 file1 > out_file
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5
Result = file with ID placed correctly into file 1 as new column, but only uses 1 key (POS) instead of 2 keys (CHR + POS).
对OP的第一次awk
尝试进行一些调整:
awk '
NR==FNR { if (FNR==1) $2="BP" # insure we can match on 2nd file header row
label[$1,$2]=$3
next
}
($1,$3) in label { print $0, label[$1,$3] }
' file2 file1
这将生成:
CHR SNP BP INFO(multiple other columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5
注意:这假设OP可以适应内存中的所有file2
;如果这是一个无效的假设(例如,OP的代码内存不足(,那么可能。。。
假设整个预期结果(即file1
,加上来自file2
的字段#3,file1
字段#1的毛绒哈希(可以放入内存,并且我们需要保持来自file1
的行的顺序。。。
一个awk
的想法需要对每个输入文件进行一次遍历:
awk '
FNR==NR { ndx=$1 FS $3
if (FNR==1) ndx = "CHR" FS "POS" # override ndx to match header from 2nd file
lines[ndx]=$0 # save current line in memory
order[FNR]=ndx # save order of current line
maxFNR=FNR # keep track of total number of lines from 1st file
next
}
{ ndx=$1 FS $2
if (ndx in lines) # if there is a match in the lines[] array then ...
lines[ndx]=lines[ndx] FS $3 # append current field #3 to lines[] entry
}
END { for (i=1;i<=maxFNR;i++) # loop through lines from 1st file and ...
print lines[order[i]] # print to stdout
}
' file1 file2
这将生成:
CHR SNP BP INFO(multiple other columns) ID
1 ABS141 132156 Random_stuff rid1
2 GSD1151 132143 Random_stuff rid3
3 KJH173 465879 Random_stuff rid5