从具有重复元素的元组列表中提取显示最大值的字符串



我有以下简化的数据结构:

input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000)
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]

在这里,一个特定的ID可以与几个序列链接(不总是相同数量的序列归属于一个特定的ID),几个ID可以与一个特定的文件名链接(不总是相同数量的ID归属于一个特定的FileName)

我想要的是提取每个ID的最大强度的三元组FileName/ID/Sequence:

输出:

output = [("FileName1", "ID1", "Sequence1"),
("FileName1", "ID2", "Sequence3"),
("FileName2", "ID1", "Sequence2")
("FileName2", "ID2", "Sequence5")]

我需要在末尾为每个ID提供一个唯一的序列(该序列具有最大值),并同时获得FileName,因为我需要所有这些信息将它们映射到之后的数据帧。

文件名将不再有任何重复的ID,一个唯一的序列将与特定的ID链接。

谢谢你的帮助

使用itertools

,

import itertools
input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000),
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]

result = []
for k, v in itertools.groupby(input, lambda x: (x[0], x[1])):
result.append(max(list(v), key=lambda x: x[-1]))
# OR
# result = [max(list(v), key=lambda x: x[-1]) for k, v in itertools.groupby(input, lambda x: (x[0], x[1]))]  

print(result)

[('FileName1', 'ID1', 'Sequence1', 1000),
('FileName1', 'ID2', 'Sequence3', 1500),
('FileName2', 'ID1', 'Sequence2', 1000),
('FileName2', 'ID2', 'Sequence5', 2000)]

相关内容

  • 没有找到相关文章