高效地查找最相似的集合(Python、数据结构)



假设我在一个名为my_sets的列表中有几千个Python集。对于my_sets中的每个集合"A",我想找到my_sets中包含最高百分比的集合A的成员的五个集合(集合"B"(。

我目前正在将数据存储为集合,并在它们上循环两次以计算重叠。。。

from random import randint
from heapq import heappush, heappop
my_sets = []
for i in range(20):
new_set = set()
for j in range(20):
new_set.add(randint(0, 50))
my_sets.append(new_set)
for i in range(len(my_sets)):
neighbor_heap = []
for j in range(len(my_sets)):
if i == j:
continue
heappush(neighbor_heap, (1 / len(my_sets[i] & my_sets[j]), j))
results = []
while len(results) < 5:
results.append(heappop(neighbor_heap)[1])
print('Closest neighbors to set {} are sets {}'.format(i, results))

然而,这显然是一个O(N**2(算法,所以当my_sets变长时,它就会爆炸。有没有更好的数据结构或算法可以在基本Python中实现来解决这个问题?my_sets没有理由必须是一个列表,或者每个单独的集合实际上必须是Python集合。任何存储每个集合是否包含来自有限选项列表的成员的方式都是可以的(例如,按标准化顺序的一堆bool或bit(。构建一个更奇特的数据结构来节省时间也是可以的。

(正如一些人可能想指出的那样,我当然可以将其构建为一个Numpy数组,其中行是集合,列是元素,单元格是1/0,这取决于该元素是否在该集合中。然后,你只需要做一些Numpy操作。这无疑会更快,但我根本没有真正改进我的算法,我只是将复杂性转移到了其他人的优化中。(zed C/Fortran/其他。(

EDIT经过全面测试,我最初发布的算法在约定的测试条件下运行约586秒。

你能:吗

  1. 反转集合,为每个集合元素生成包含它的集合的列表(或集合(。

    这是O(n*m(——对于n集合,并且平均每个集合m元素。

  2. 对于每个集合S,考虑其元素,并(使用1(构建其他集合的列表(或堆(,以及每个集合与S共享的元素数量——选择"最佳"5。

    其为O(n*m*a(,其中a是每个元素所属的集合的平均数。

O(n*n(的距离明显取决于ma

编辑:Python中的Naive实现在我的机器上运行103秒。。。

old_time = clock()
my_sets = []
for i in range(10000):
new_set = set()
for j in range(200):
new_set.add(randint(0, 999))
my_sets.append(new_set)
my_inv_sets = [[] for i in range(1000)]
for i in range(len(my_sets)):
for j in range(1000):
if j in my_sets[i]:
my_inv_sets[j].append(i)
for i in range(len(my_sets)):
counter = Counter()
for j in my_sets[i]:
counter.update(my_inv_sets[j])
print(counter.most_common(6)[1:])
print(clock() - old_time)

您可以通过构建与每个值关联的集合索引列表来减少通过集合列表的次数。然后,通过对集合列表进行一次额外的遍历,您可以通过编译集合中每个值的索引数量来确定哪些集合具有最常见的值。

在某些情况下,这将提高性能,但根据数据密度的不同,这可能不会有太大的差异。

下面是一个使用collections模块中的defaultdict和Counter的示例。

from collections import defaultdict,Counter
def top5Matches(setList):
valueSets = defaultdict(list)
for i,aSet in enumerate(setList):
for v in aSet: valueSets[v].append(i)
results = []
for i,aSet in enumerate(setList):
counts = Counter()
for v in aSet: counts.update(valueSets[v])
counts[i] = 0
top5      = [setList[j] for j,_ in counts.most_common(5)]
results.append((aSet,top5))
return results

为了比较执行时间,我冒昧地将您的解决方案嵌入到一个函数中。我还必须修复两个集合根本没有交集的情况:

from heapq import heappush, heappop
def OPSolution(my_sets):
results = []
for i in range(len(my_sets)):
neighbor_heap = []
for j in range(len(my_sets)):
if i == j: continue
heappush(neighbor_heap, (1 / max(1,len(my_sets[i] & my_sets[j])), j))
top5 = []
while len(top5) < 5:
j = heappop(neighbor_heap)[1]
top5.append(my_sets[j])
results.append((my_sets[i],top5))    
return results

这两个函数都返回一个包含原始集合的元组列表,以及一个基于公共值数量的前5个集合的列表。

这两个函数产生相同的结果,尽管当第六个(或更多(额外集合的交集计数相同时,前5个集合可能不相同。

from random import randrange
my_sets = [ set(randrange(50) for _ in range(20)) for _ in range(20) ]
opResults = OPSolution(my_sets)
print("OPSolution: (matching counts)")
for i,(aSet,top5) in enumerate(opResults):
print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")
print("top5Matches: (matching counts)")
t5mResults = top5Matches(my_sets)
for i,(aSet,top5) in enumerate(t5mResults):
print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")

输出:

OPSolution: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]
top5Matches: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]

比较各种设置组合的执行时间表明,按值索引在较大的数据集上表现更好(尽管在某些情况下不太好(:

[EDIT]添加了Chris Hall的解决方案,通过将功能限制在连续范围内的值集来衡量速度的提高。我还必须将它嵌入到一个函数中,并测试结果是否相同。我意识到,在这样做的同时,我们基本上有相同的方法。主要区别在于Chris使用了一个列表,而不是一个字典,它将值限制在必须提供大小的range((中。

def chrisHall(my_sets,valueRange):
results = []
my_inv_sets = [[] for i in range(valueRange)]
for i in range(len(my_sets)):
for j in range(valueRange):
if j in my_sets[i]:
my_inv_sets[j].append(i)
for i in range(len(my_sets)):
counter = Counter()
for j in my_sets[i]:
counter.update(my_inv_sets[j])
top5 = [my_sets[j] for j,_ in counter.most_common(6)[1:]]
results.append((my_sets[i],top5))
return results

性能测试也嵌入了一个函数中,以避免重复样板代码:

from random import randrange
from timeit import timeit
def compareSolutions(title,setCount,setSize,valueRange,count=1):
print("-------------------")
print(title,setCount,"sets of",setSize,"elements in range 0 ...",valueRange)
testSets = [ set(randrange(valueRange) for _ in range(setSize)) for _ in range(setCount) ]
t = timeit(lambda: chrisHall(testSets,valueRange),number=count)
print("chrisHall",t)
t = timeit(lambda: top5Matches(testSets),number=count)
print("top5Matches",t)
t = timeit(lambda: OPSolution(testSets),number=count)
print("OPSolution",t)
compareSolutions("SIMPLE TEST SET",20,20,50,count=100)
compareSolutions("MORE SETS:",2000,20,50)
compareSolutions("FEWER INTERSECTIONS:",2000,20,500)
compareSolutions("LARGER SETS:",2000,200,500)
compareSolutions("SETTING FROM COMMENTS:",10000,200,1000)

结果:

-------------------
SIMPLE TEST SET 20 sets of 20 elements in range 0 ... 50
chrisHall 0.0766431910000005
top5Matches 0.07549873900000037
OPSolution 0.05089954700000021
-------------------
MORE SETS: 2000 sets of 20 elements in range 0 ... 50
chrisHall 1.274499733999999
top5Matches 1.2646208220000013
OPSolution 3.796912927000001
-------------------
FEWER INTERSECTIONS: 2000 sets of 20 elements in range 0 ... 500
chrisHall 0.4685694170000012
top5Matches 0.42844527900000173
OPSolution 3.5187148479999983
-------------------
LARGER SETS: 2000 sets of 200 elements in range 0 ... 500
chrisHall 8.538208329
top5Matches 8.51855685
OPSolution 23.192823251999997
-------------------
SETTING FROM COMMENTS: 10000 sets of 200 elements in range 0 ... 1000
chrisHall 190.55364428999997
top5Matches 176.066835327
OPSolution 829.934181724

我使用集合交集来查找公共元素,然后根据它们包含的公共元素的数量对这些元素进行排序,这里有一个你可能想尝试的代码:

from random import randint
from heapq import heappush, heappop
my_sets = []
for i in range(20):
new_set = set()
for j in range(20):
new_set.add(randint(0, 50))
my_sets.append(new_set)

for i in range(len(my_sets)):
temp = dict()
for j in range(len(my_sets)):
if i == j:
continue
diff = my_sets[i] & my_sets[j]
temp[j] = diff
five = sorted(temp.items(), reverse=True, key=lambda s: len(s[1]))[:5]
five_indexes = [t[0] for t in five]
print('Closest neighbors to set {} are sets {}'.format(i, five_indexes))

使用heapq.nlargest:更简单、更快(在限制为10000、200、1000的大情况下,看起来像5-10%(

for i in range(len(my_sets)):
results = nlargest(5,
(j for j in range(len(my_sets)) if j != i),
key=lambda j: len(my_sets[i] & my_sets[j]))
print('Closest neighbors to set {} are sets {}'.format(i, results))

这不是用N-1个元素构建堆,而是用5个元素构建。

最新更新