我有一个很大的jsn列表,其中包含许多可能具有重复值的字符串元素。我需要检查每个元素的相似性,并在dubs列表中添加重复的列表项键,以从jsn列表中删除这些项。
由于jsn列表的大小,我决定在代码中使用线程来加快循环执行的秒数和等待时间
但是线程/进程并没有像我预期的那样工作。
下面包含线程的代码在性能上没有任何变化,而且在线程加入完成后,配音列表为空
我尝试了但没有成功。join((,但我仍然得到了空的配音列表,并且性能没有变化。
主要问题->在开始删除重复项之前,配音列表为空
from threading import Thread
from multiprocessing import Process
from difflib import SequenceMatcher
# Searching for dublicates in array
def finddubs(jsn,dubs,a):
for b in range(len(jsn)):
if ((jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40)):
dubs.append(b) # add dublicate list element keys to dublicates array
# Start threading
threads = []
for a in range(len(jsn)):
t = Thread(target=finddubs, args=(jsn,dubs,a))
threads.append(t)
t.start()
for thr in threads:
thr.join()
# Delete duplicate list items
for d in dubs:
k = int(d)
del jsn[k]
没有线程的代码正在工作
如果要加快计算速度,则需要使用multiprocessing
而不是threading
。请阅读有关GIL的详细信息。
multiprocessing
如何用于此任务的示例:
import multiprocessing
from difflib import SequenceMatcher
from uuid import uuid4
# Let's generate a large list with random data
# where we have few duplicates: "abc" indices: 0, 1_001 ; "b" - indices 1_002, 1_003
jsn = ['abc'] + [str(uuid4()) for _ in range(1_000)] + ['abc', 'b', 'b']
def compare_strings(a: int, b: int):
if ((jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40)):
return a, b
# now we are comparing all possible pairs using multiprocessing
with multiprocessing.Pool(processes=10) as pool:
results = pool.starmap(compare_strings, [(i, j) for i in range(len(jsn)) for j in range(i + 1, len(jsn))])
for result in results:
if result is not None:
a, b = result
print(f"Duplicated pair: {a} {b} {jsn[b]}")
# delete duplicates
修改你的代码应该工作:
from difflib import SequenceMatcher
from threading import Thread
from uuid import uuid4
# Let's generate a large list with random data
# where we have few duplicates: "abc" indices: 1, 10_001 ; "b" - indices 10_002, 10_003
jsn = ["abc"] + [str(uuid4()) for _ in range(1_00)] + ["abc", "b", "b"]
dubs = []
# Searching for dublicates in array
def finddubs(jsn, dubs, a):
for b in range(a + 1, len(jsn)):
if (jsn[a] == jsn[b]) or (SequenceMatcher(None, jsn[a], jsn[b]).ratio() > 40):
print(a, b)
dubs.append(b) # add dublicate list element keys to dublicates array
# Start threading
threads = []
for a in range(len(jsn)):
t = Thread(target=finddubs, args=(jsn, dubs, a))
threads.append(t)
t.start()
for thr in threads:
thr.join()
# Delete duplicate list items
print(dubs)
for d in dubs:
k = int(d)
del jsn[k]