将字符串从一个文件匹配到另一个文件并附加值



我有两个 350 m+ 行的文本文件。两个文本文件的内容如下:

file1:
>15_48499991_ENSG00000074803_C_G_G
CCAATCGCTTTCAAGTTAGTGTG
>15_48499991_ENSG00000074803_C_G_G
CAATCGCTTTCAAGTTAGTGTGA
>15_48499991_ENSG00000074803_C_G_G
AATCGCTTTCAAGTTAGTGTGAT
file2:
CCAATCGCTTTCAAGTTAGTGTG -14.48
CAATCGCTTTCAAGTTAGTGTGA -29.94
AATCGCTTTCAAGTTAGTGTGAT -20.58

我想匹配文件 1 的第 2 列的值和 file1 中的字符串,并在匹配时将 file1>后的值附加到 file2 之后。

所需的输出为:

15_48499991_ENSG00000074803_C_G_G    CCAATCGCTTTCAAGTTAGTGTG    -14.48
15_48499991_ENSG00000074803_C_G_G    CAATCGCTTTCAAGTTAGTGTGA    -29.94
15_48499991_ENSG00000074803_C_G_G    AATCGCTTTCAAGTTAGTGTGAT    -20.58

任何建议在这里都会有所帮助。

谢谢

试试这个:

with open( 'file2' ) as fin :
data = { i.strip().split() for i in fin }
with open( 'file1' ) as fin :
for line in fin :
if line.startswith('>') :
print line[1:].strip(),
else :
stripped = line.strip()
print stripped, data[stripped]

您可以从 file1 创建字典并使用它来处理 file2。

from io import StringIO
file1 = '''
>15_48499991_ENSG00000074803_C_G_G
CCAATCGCTTTCAAGTTAGTGTG
>15_48499991_ENSG00000074803_C_G_G
CAATCGCTTTCAAGTTAGTGTGA
>15_48499991_ENSG00000074803_C_G_G
AATCGCTTTCAAGTTAGTGTGAT
'''
file2 = '''
CCAATCGCTTTCAAGTTAGTGTG -14.48
CAATCGCTTTCAAGTTAGTGTGA -29.94
AATCGCTTTCAAGTTAGTGTGAT -20.58
'''
# Create a look-up table from first file.
map = {}
with StringIO(file1) as file:  # Open file1.
for line in file:
first = line.rstrip()[1:]  # Remove leading '>'.
second = next(file).rstrip()
map[second] = first
# Output matches in desired format.
with StringIO(file2) as file:  # Open file2.
for line in file:
first, second = line.split()
print(f'{map[first]}    {first}    {second}')

使用 bash/join,假设 file2 中的分隔符是一个制表符

join -t $'t' -1 2 -2 1 
<(cat file1 | paste - - | sort -t $'t' -k2,2) 
<(sort -t $'t' -k1,1 file2)

相关内容

最新更新