我正在做一个项目,我已经到了需要从主列表中删除任何重复项的地步。我这里有三个列表,我正在努力消除flight_ID列表中的重复项。我设法做到了,但不幸的是,我无法删除与flight_ID列表中删除的元素相关联的其他元素。
# All lists have a length of 20
flight_ID = ['1064662221', '1064617390', '1064614152', '1064614152',
'1064775880', '1064645826', '1064645826', '1064664535', '1064659772',
'1064659772', '1064614050', '1064614050', '1064614286', '1064614286',
'1064614286', '1064614286', '1064614286', '1064614286', '1064614286', '1064646536']
flight_number = ['1827', '1585', '8409', '1465', '30', '9188', '2232', '3760', '579', '3309', '1259', '2193', '6566', '2231', '5214', '8601', '3169', '1601', '7832', '335']
airline_Code = ['TK', 'AY', 'DL', 'AF', 'FX', 'UA', 'LH', 'U2', 'SK', 'A3', 'AF', 'KL', 'VS', 'UX', 'G3', 'UU', 'KQ', 'AF', 'AR', 'LO']
我使用以下功能从主列表中删除重复项:
def remove_dup(a):
i = 0
while i < len(a):
j = i + 1
while j < len(a):
if a[i] == a[j]:
del a[j]
else:
j += 1
i += 1
remove_dup(flight_ID)
# OUTPUT
['1064662221', '1064617390', '1064614152', '1064775880', '1064645826', '1064664535', '1064659772', '1064614050', '1064614286', '1064646536']
# 10 elements have been removed.
现在,正如我上面所描述的,我需要对其他列表做同样的事情,所以与主列表(flight_ID(中的项目匹配的项目也会被删除。
注意:尽管主列表显示重复项目,但其他列表的项目不会
如果您要对以您所描述的方式格式化的数据做更多的处理,我建议使用Pandas
,因为它可以以无痛的方式删除重复项等操作:
import pandas as pd
# Make a DataFrame
flight_ID = ['1064662221', '1064617390', ...]
flight_number = ['1827', '1585', '8409', ...]
airline_Code = ['TK', 'AY', 'DL', ...]
df = pd.DataFrame({'flight_ID': flight_ID,
'flight_number': flight_number,
'airline_Code': airline_Code})
# Remove duplicates - just one line!
df.drop_duplicates('flight_ID', inplace=True)
你会得到一个看起来像这样的DataFrame:
flight_ID flight_number airline_Code
0 1064662221 1827 TK
1 1064617390 1585 AY
2 1064614152 8409 DL
4 1064775880 30 FX
5 1064645826 9188 UA
7 1064664535 3760 U2
8 1064659772 579 SK
10 1064614050 1259 AF
12 1064614286 6566 VS
19 1064646536 335 LO
首先,根据需要更改表示以链接项目,而不是使用并行列表。
flight_list = zip(flight_ID, flight_number, airline_Code)
这使得更容易删除三个相关项目。
现在,使用任何标准的方法删除重复项。在每一个中构建一个新的列表:正如本网站上的许多帖子所记录的那样,改变迭代目标是个坏主意。将其保持在您演示的编程级别:
unique_flight = []
found_ID = set()
for flight in flight_list:
if flight[0] not in found_ID:
found_ID.add(flight[0])
unique_flight.append(flight)
for flight in unique_flight:
print(flight)
输出:
('1064662221', '1827', 'TK')
('1064617390', '1585', 'AY')
('1064614152', '8409', 'DL')
('1064775880', '30', 'FX')
('1064645826', '9188', 'UA')
('1064664535', '3760', 'U2')
('1064659772', '579', 'SK')
('1064614050', '1259', 'AF')
('1064614286', '6566', 'VS')
('1064646536', '335', 'LO')
这里有几种方法,但我会考虑使用一个类来表示这种数据(类似于namedtuple示例的工作方式(
将flight_ID作为关键字添加到字典中,使其具有唯一性,并将值作为索引:
flight_ID_inds = {f: i for i, f in enumerate(flight_ID)}
flight_ID = list(flight_ID_inds.keys())
flight_number = [flight_number[i] for i in flight_ID_inds.values()]
airline_Code = [airline_Code[i] for i in flight_ID_inds.values()]
同样,但将值作为其他列表数据的元组,而不是索引:
dic = {fid: (fn, ac) for fid, fn, ac in zip(flight_ID, flight_number, airline_Code)}
flight_ID = list(dic.keys())
flight_number = [x[0] for x in dic.values()]
airline_Code = [x[1] for x in dic.values()]
使用命名元组(使用dicts表示的列表也可以(:
from collections import namedtuple
flight_nt = namedtuple("Flight", "flight_ID, flight_number, airline_Code")
flights = [flight_nt(fid, fn, ac) for fid, fn, ac in zip(flight_ID, flight_number, airline_Code)]
uniq_ids = set()
uniq_flights = []
for f in flights:
if f.flight_ID not in uniq_ids:
uniq_ids.add(f.flight_ID)
uniq_flights.append(f)
flight_ID = [x.flight_ID for x in uniq_flights]
flight_number = [x.flight_number for x in uniq_flights]
airline_Code = [x.airline_Code for x in uniq_flights]
对于这种问题,我推荐一个面向对象的(类或数据类(:
class Flight:
def __init__(self, flight_id, flight_number, airline_code):
self.flight_id = flight_id
self.flight_number = flight_number
self.airline_code = airline_code
def __hash__(self):
return hash(self.flight_id)
def __eq__(self, other):
return other.flight_id == self.flight_id
flights = [Flight(fid, fn, ac) for fid, fn, ac in zip(flight_ID, flight_number, airline_Code)]
uniq_flights = set(flights)
@Prune有一个更好的解决方案,但您可以始终使用enumerate()
for index, id in enumerate(flight_ID):
if id in flight_ID[index:]:
del flight_ID[index]
del flight_number[index]
del airline_Code[index]
注意,这并不能保持顺序,如果你想这样做,你必须在切片中找到值的索引。
您可以首先确定要保留/删除的元素,然后使用itertools.compress
删除元素:
import itertools as it
keep = []
seen = set()
for x in flight_ID:
keep.append(x not in seen)
seen.add(x)
flight_ID = list(it.compress(flight_ID, keep))
flight_number = list(it.compress(flight_number, keep))
airline_Code = list(it.compress(airline_Code, keep))
然而,由于这些数据似乎在逻辑上属于一起,因此为其创建一个专用容器类可能是个好主意,例如通过namedtuple
:
from collections import namedtuple
FlighData = namedtuple('id number code')
data = [FlightData(*x) for x in zip(flight_ID, flight_number, airline_Code)]
那么另一种方法是使用itertools.groupby
:
unique_data = list(next(g) for k, g in it.groupby(sorted(data), key=op.itemgetter(0)))