Python Gensim如何通过多处理使WMD相似性运行得更快



我正在尝试更快地运行gensim WMD相似性。通常,这是文档中的内容: 示例语料库:

my_corpus = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]
my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']
model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)

from gensim import Word2Vec
from gensim.similarities import WmdSimilarity
def init_instance(my_corpus,model,num_best):
instance = WmdSimilarity(my_corpus, model,num_best = 1)
return instance
instance[my_tokenized_query]

最匹配的文档是"Human machine interface for lab abc computer applications"这很棒。

但是,上面instance的功能需要很长时间。所以我想把语料库分成N部分,然后用num_best = 1对每个部分做WMD,然后在最后,得分最高的部分将是最相似的。

from multiprocessing import Process, Queue ,Manager
def main( my_query,global_jobs,process_tmp):
process_query = gensim.utils.simple_preprocess(my_query)
def worker(num,process_query,return_dict):  
instance=init_instance
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
x = instance[process_query][0][0]
y = instance[process_query][0][1]
return_dict[x] = y
manager = Manager()
return_dict = manager.dict()
for num in range(num_workers):
process_tmp = Process(target=worker, args=(num,process_query,return_dict))
global_jobs.append(process_tmp)
process_tmp.start()
for proc in global_jobs:
proc.join()
return_dict = dict(return_dict)
ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
print corpus[ind]
>>> "Graph minors A survey"

我遇到的问题是,即使它输出了一些东西,即使它获得了所有部分的最大相似性,它也不会从我的语料库中给我一个很好的类似查询。

我做错了什么吗?

注释:chunk 是一个静态变量:例如 chunk = 600 ...

如果定义chunk静态,则必须计算num_workers

10001 / 600 = 16,6683333333 = 17 num_workers

通常使用不超过您拥有coresprocess
如果你有17 cores,那没关系。

cores是静态的,因此您应该:

num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)

  1. 结果不一样,改为:

    #process_query = gensim.utils.simple_preprocess(my_query)
    process_query = my_tokenized_query
    
  2. 所有worker结果索引 0..n.
    因此,可以从最后一个具有较低值的相同索引的工作线程中覆盖return_dict[x]。return_dict中的索引与my_corpus中的索引不同。更改为:

    #return_dict[x] = y
    return_dict[ (num * chunk)+x ] = y
    
  3. 在块大小计算中使用+1将跳过第一个文档
    我不知道你是如何计算chunk的,考虑这个例子:

    def chunksize(iterable, num_workers):
    c_size, extra = divmod(len(iterable), num_workers)
    if extra:
    c_size += 1
    if len(iterable) == 0:
    c_size = 0
    return c_size
    #Usage
    chunk = chunksize(my_corpus, num_workers)
    ...
    #my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
    my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
    

结果:10 个周期,元组=(索引工作线程数=0,索引工作线程数=1)

multiprocessing,带chunk=5
:02,09:(3,8),01,03:(3,5):
EPS
的系统和人体系统工程测试 04,06,07:(0,8),05,08:(0,5),10:(0,7):
实验室ABC计算机应用的人机界面

不带multiprocessing, 与chunk=5

:01:(3, 6), 02:(3, 5),05,08,10:(3, 7), 07,09:(3, 8):EPS
的系统和人体系统工程测试 03,04,06:
(0, 5):

实验室ABC计算机应用的人机界面

不带multiprocessing,不分块:
01,02,03,04,06,07,08:(3,-1):
EPS
的系统和人体系统工程测试 05,09,10:(0,-1):
实验室ABC计算机应用的人机界面

用 Python 测试:3.4.2

使用 Python 2.7: 我使用线程而不是多处理。 在 WMD 实例创建线程中,我做了这样的事情:

wmd_instances = []
if wmd_instance_count > len(wmd_corpus):
wmd_instance_count = len(wmd_corpus)
chunk_size = int(len(wmd_corpus) / wmd_instance_count)
for i in range(0, wmd_instance_count):
if i == wmd_instance_count -1:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
else:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
wmd_instances.append(wmd_instance)
wmd_logic.setWMDInstances(wmd_instances, chunk_size)

"wmd_instance_count"是用于搜索的线程数。我还记得块大小。然后,当我想搜索某些东西时,我开始搜索"wmd_instance_count"线程,它们返回找到的模拟人生:

def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
wmd_instance = wmd_instances[instance]
sims = wmd_instance[query]
wmd_logic.set_mt_thread_result(jobID, instance, sims)

"wmd_logic"是一个类的实例,然后执行以下操作:

def set_mt_thread_result(self, jobID, instance, sims):
res = []
#
# We need to scale the found ids back to our complete corpus size...
#
for sim in sims:
aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
res.append(aSim)

我知道,代码不好,但它有效。它使用"wmd_instance_count"线程来查找结果,我聚合它们,然后选择前 10 名或类似的东西。

希望这有帮助。

相关内容

  • 没有找到相关文章

最新更新