我有以下简化的数据结构:
input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000)
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]
在这里,一个特定的ID可以与几个序列链接(不总是相同数量的序列归属于一个特定的ID),几个ID可以与一个特定的文件名链接(不总是相同数量的ID归属于一个特定的FileName)
我想要的是提取每个ID的最大强度的三元组FileName/ID/Sequence:
输出:
output = [("FileName1", "ID1", "Sequence1"),
("FileName1", "ID2", "Sequence3"),
("FileName2", "ID1", "Sequence2")
("FileName2", "ID2", "Sequence5")]
我需要在末尾为每个ID提供一个唯一的序列(该序列具有最大值),并同时获得FileName,因为我需要所有这些信息将它们映射到之后的数据帧。
文件名将不再有任何重复的ID,一个唯一的序列将与特定的ID链接。
谢谢你的帮助
使用itertools
,
import itertools
input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000),
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]
result = []
for k, v in itertools.groupby(input, lambda x: (x[0], x[1])):
result.append(max(list(v), key=lambda x: x[-1]))
# OR
# result = [max(list(v), key=lambda x: x[-1]) for k, v in itertools.groupby(input, lambda x: (x[0], x[1]))]
print(result)
[('FileName1', 'ID1', 'Sequence1', 1000),
('FileName1', 'ID2', 'Sequence3', 1500),
('FileName2', 'ID1', 'Sequence2', 1000),
('FileName2', 'ID2', 'Sequence5', 2000)]