在python中有效地搜索庞大字典中的字符串/值

>假设我巨大的字典，例如huge_dict={'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK' ... , 'KEY N': 'XYZ'}在
huge_dict中搜索值需要花费大量时间我正在尝试多处理技术，因为它使用不同的内核我正在尝试做以下步骤步骤：1：在 m 中拆分huge_dict 小字典
2：在 python 中创建 m 进程并将 seraching 值传递给它
3：
如果任何进程获得该值，则终止所有进程。

在此之前，我加载深度学习/机器学习模型。当尝试使用多处理时，它会在我的 prrocess 生成时加载 mnay 时间其输出为 huge_dict

huge_dict = {'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK'}
d1 = dict(huge_dict.items()[len(huge_dict)/2:])
d2 = dict(huge_dict.items()[:len(huge_dict)/2])
# Is this an efficient  way to do it ? what if  I split in n dict 
def worker(dict , searck_value, num):
"""thread worker function"""
print('Worker:', num)
print(mp.cpu_count())
return dict
#is is correct way to use multiprocessing
#current using  time consuming logic:
def search(d, word)
d = {'key1': "ASD", 'key2': "asd", 'key3':"fds", 'key4':"gfd", 'key5': "hjk"}
for key in d:
if(d[key] in "search sentence or grp of words")#doing fuzzy search here
return d[key]

数据格式如下：

huge_dict={"10001": ["sentence1", "sentence2","sentence3","sentence4"],
"4001": ["sentence1", "sentence2"], 
"35432": ["sentence1", "sentence2","sentence3","sentence4", ... "sentence N"],  
.....
"N":["N no of sentences"]    }

我假设您想检查给定字符串中是否有任何huge_dict值作为子字符串(不仅是单词)存在。
尝试给定字符串的huge_dict.values()和所有子字符串的set.intersection是否更快：

def sub(s):
""" Return all substrings of a given string """
return [s[i:j+1] for i in range(len(s)) for j in range(i,len(s))]

huge_dict = {'Key1': 'ABC' , 'Key 2' : 'DEF' ,'KEY 4' :'GHI', 'KEY5': 'IJK'}
s = "A REKDEFY, CI"
huge_values = set(huge_dict.values())
>>> print(huge_values.intersection(sub(s))
{'DEF'}

相关内容

最新更新

热门标签：