我有两个 350 m+ 行的文本文件。两个文本文件的内容如下:
file1:
>15_48499991_ENSG00000074803_C_G_G
CCAATCGCTTTCAAGTTAGTGTG
>15_48499991_ENSG00000074803_C_G_G
CAATCGCTTTCAAGTTAGTGTGA
>15_48499991_ENSG00000074803_C_G_G
AATCGCTTTCAAGTTAGTGTGAT
file2:
CCAATCGCTTTCAAGTTAGTGTG -14.48
CAATCGCTTTCAAGTTAGTGTGA -29.94
AATCGCTTTCAAGTTAGTGTGAT -20.58
我想匹配文件 1 的第 2 列的值和 file1 中的字符串,并在匹配时将 file1>
后的值附加到 file2 之后。
所需的输出为:
15_48499991_ENSG00000074803_C_G_G CCAATCGCTTTCAAGTTAGTGTG -14.48
15_48499991_ENSG00000074803_C_G_G CAATCGCTTTCAAGTTAGTGTGA -29.94
15_48499991_ENSG00000074803_C_G_G AATCGCTTTCAAGTTAGTGTGAT -20.58
任何建议在这里都会有所帮助。
谢谢
试试这个:
with open( 'file2' ) as fin :
data = { i.strip().split() for i in fin }
with open( 'file1' ) as fin :
for line in fin :
if line.startswith('>') :
print line[1:].strip(),
else :
stripped = line.strip()
print stripped, data[stripped]
您可以从 file1 创建字典并使用它来处理 file2。
from io import StringIO
file1 = '''
>15_48499991_ENSG00000074803_C_G_G
CCAATCGCTTTCAAGTTAGTGTG
>15_48499991_ENSG00000074803_C_G_G
CAATCGCTTTCAAGTTAGTGTGA
>15_48499991_ENSG00000074803_C_G_G
AATCGCTTTCAAGTTAGTGTGAT
'''
file2 = '''
CCAATCGCTTTCAAGTTAGTGTG -14.48
CAATCGCTTTCAAGTTAGTGTGA -29.94
AATCGCTTTCAAGTTAGTGTGAT -20.58
'''
# Create a look-up table from first file.
map = {}
with StringIO(file1) as file: # Open file1.
for line in file:
first = line.rstrip()[1:] # Remove leading '>'.
second = next(file).rstrip()
map[second] = first
# Output matches in desired format.
with StringIO(file2) as file: # Open file2.
for line in file:
first, second = line.split()
print(f'{map[first]} {first} {second}')
使用 bash/join,假设 file2 中的分隔符是一个制表符
join -t $'t' -1 2 -2 1
<(cat file1 | paste - - | sort -t $'t' -k2,2)
<(sort -t $'t' -k1,1 file2)