有没有一种更快的方法可以找到列表中两个元素的同时出现



我有一个这样的列表。

a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]

给我x(例如:"a1"(。我必须找到a1与其他元素的共现性,并对其进行排序,然后检索前n个元素(例如:前2个(我的答案应该是

[
{'product_id': 'b2', 'count': 4}, 
{'product_id': 'c3', 'count': 3}, 
]

我当前的代码如下:

def compute (x):
set_a = list(set(list(itertools.chain(*a))))
count_dict = []
for i in range(0, len(set_a)):
count = 0
for j in range(0, len(a)):
if x==set_a[i]:
continue
if x and set_a[i] in a[j]:
count+=1
if count>0:
count_dict.append({'product_id': set_a[i], 'count': count})
count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
return count_dict

对于较小的输入,它也能很好地工作。然而,我的实际输入有70000个唯一项,而不是5个(a到e(,还有130万行,而不是五行。因此mxn变得非常详尽。有更快的方法吗?

"更快";是一个非常笼统的术语。您需要更短的总处理时间,还是更短的请求响应时间?这只是针对一个请求,还是您想要一个处理重复输入的系统?

如果您需要的是重复输入的最快响应时间,那么将整个列表转换为一个图,每个元素都作为一个节点,边权重是两个元素之间的出现次数。您只需对数据进行一次遍历即可构建图形。对于每个节点,按权重对边列表进行排序。从那里,每个请求都是一个简单的查找:返回节点顶部边缘的权重,这是一个散列(线性函数(和两个直接访问操作(基地址+偏移量(。


OP响应后的更新

"最快响应";然后密封算法。您想要的是一个简单的dict,由每个节点键控。每个节点的值是相关元素及其计数的排序列表。

图形包(比如networkx(将为您提供一个很好的入口,但可能不会以快速形式保留节点的边,也不会按权重排序。相反,预先处理您的数据库。对于每一行,都有一个相关元素的列表。让我们来看看对数据集中某一行的处理;调用元素a5, b2, z1和dictd。假设a5, b2已经在您的dict中。

using `intertools`, Iterate through the six pairs.
(a5, b2):
d[a5][b2] += 1
(a5, z1):
d[a5][z1]  = 1  (creates a new entry under a5)
(b2, a5):
d[b2][a5] += 1
(b2, z1):
d[b2][z1]  = 1  (creates a new entry under b2)
(z1, a5):
d[z1] = {}      (creates a new z1 entry in d)
d[z1][a5]  = 1  (creates a new entry under z1)
(z1, b2):
d[z1][b2]  = 1  (creates a new entry under z1)

您希望使用defaultdict来省去检测和初始化新条目的麻烦。

处理完所有这些之后,您现在需要根据子级别的值将这些子dict中的每一个按顺序排序。这将为每个元素留下一个有序的序列。当您需要访问顶部的n连接元素时,您可以直接访问dict并提取它们:

top = d[elem][:n]

你能从那里完成编码吗?

正如@preeme所提到的,没有提到您想要更短的处理时间还是更短的响应时间。所以我将解释两种解决这个问题的方法

  1. 优化的代码方法(处理时间更短(
from heapq import nlargest
from operator import itemgetter
#say we have K THREADS
def compute (x,    top_n=2):
# first you find the unique items and save them somewhere easily accessible
set_a = list(set(list(itertools.chain(*a))))

#first find that in which of your ROWS the x exists
selected_rows=[]
for i,row in enumerate(a): #this whole loop can be parallelized
if x in row: 
selected_rows.append(i) #append index of the row in selected_rows array


# time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row       # can be evaluated independently M items can be evaluated independently 
# THE M rows can be run in parallel, if we have K threads
# it is going to take us (M/K)*N time complexity to run it.

count_dict=[]


# now the same thing you did earlier but now in second loop we are looking for less rows
for val in set_a:
if val==x:
continue
count=0
for ri in selected_rows: # this whole part can be parallelized as well
if val in a[ri]:
count+=1
count_dict.append({'product_id':val, 'count': count})
# if our selected rows size on worst case is M itself
# and our unique values are U, the complexity
# will be (U/K)*(M/K)*N


res = nlargest(top_n, count_dict, key = itemgetter('count'))
return res

让我们在这里计算时间复杂性如果我们有K个线程,那么

O((M/K)*N)+O((U/K)*(M/K)*N))

其中

M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
  1. Prune建议的图形方法
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
big_dictionary[x]=[]
for y in set_a:
count=0
if x==y:
continue
for arr in a:
if (x in arr) and (y in arr):
count+=1
big_dictionary[x].append((y,count))
for x in big_dictionary:
big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)

让我们在这里计算这个的时间复杂性一次性复杂性为:

O(U*U*M*N)

其中

M---> Total rows
N---> Total Columns
U---> Unique Values

但是一旦这个big_division被计算一次,只需1步即可获得topN值例如,如果我们想获得a1 的前3个值

result=big_dictionary['a1'][:3]

我遵循了上面@Prune建议的defaultdict方法。这是最后的代码:

from collections import defaultdict
def recommender(input_item, b_list, n):
count =[]
top_items = []
for x in b.keys():
lst_2 = b[x]
common_transactions = len(set(b_list) & set(lst_2))
count.append(common_transactions)

top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
top_values_counts = [count[i] for i in top_ids]

key_list = list(b.keys())
for i,v in enumerate(top_ids):
item_id = key_list[v]
top_items.append({item_id:  top_values_counts[i]})
print(top_items)
return top_items
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
b = defaultdict(list) 
for i, s in enumerate(a):
for key in s : 
b[key].append(i)

input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)

以下是"a1"前3名的输出:

[{'b2': 4}, {'c3': 3}, {'d4': 2}]

谢谢!!!

最新更新