Python-无法使用多处理生成伯特嵌入

我正在尝试使用BERT为单词列表生成单词嵌入。

preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
small_bert = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1'
bert_pre = hub.KerasLayer(preprocess)
bert_model = hub.KerasLayer(small_bert)
def word_embedding(sent):
if type(sent)==str:
sent = [sent]
return bert_model(bert_pre(sent))['pooled_output']

上面的代码提供了一个从字符串/字符串列表中获取单词嵌入的函数

代码工作得很好，我能够获得像下面一样的单词嵌入

word_embedding("This is a dog")
<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[-0.9999982 ,  0.09135217, -0.9993329 ,  0.96194535, -0.9991869 ,
0.07579318, -0.9892281 , -0.9668794 ,  0.06745824,  0.07896714,
-0.81101316,  0.01557093, -0.11514608,  1.        , -0.9598647 ,
-0.83923537,  0.8639652 ,  0.04229677, -0.9274309 ,  0.87699145,
0.9675325 ,  0.02446483,  0.96745443,  0.9137065 , -0.99994516,
-0.00484818, -0.9999251 ,  0.97071564,  0.9577635 ,  0.12064736,
0.13986832,  0.01977904, -0.99208915,  0.1142559 ,  0.98749965,
0.9999112 , -0.93131316, -0.05863096,  0.9166601 , -0.9995932 ,
0.92014617,  0.94889516, -0.9995392 ,  0.9891672 , -0.9999985 ,
-0.15958041, -0.99989873,  0.9984788 ,  0.9674587 ,  0.9849434 ,
0.9884202 , -0.5348996 ,  0.07992988,  0.9977897 ,  0.99813277,
0.9999679 , -0.99952024, -0.970687  ,  0.9040054 , -0.9458808 ,
0.01751512,  0.36849502,  0.3939357 ,  0.9101836 , -0.15718344,
-0.99999946, -0.44872832, -0.6077106 ,  0.96371555,  0.5564301 ,
0.9982054 , -0.09481359, -0.9996993 ,  0.03875534,  0.65399134,
-0.9902064 ,  0.66297245,  0.10515413, -0.97484773,  0.18679208,
-0.5837009 , -0.12993163, -0.96478623, -0.99981767,  0.99985546,
-0.98870945,  0.8561884 , -0.58723676, -0.68301636,  0.67417735,
-0.9766185 ,  0.9956491 , -0.88204795,  0.99866074,  0.2829505 ,
0.42085564, -0.9546872 , -0.8894943 , -0.9999068 , -0.97645766,
-0.99447215,  0.97132486, -0.9995873 , -0.90443873, -0.9787839 ,
-0.6670069 , -0.9991659 , -0.9913582 , -0.19619215,  0.9979996 ,
0.99873877,  0.94075304, -0.76902175,  0.9997495 , -1.        ,
0.06643485,  0.8816498 ,  0.83833504,  0.09686996, -0.9954674 ,
0.22044522, -0.99998134, -0.5231443 ,  0.902108  , -0.9998227 ,
0.97717   ,  0.9373147 ,  0.9990008 ]], dtype=float32)>

所以现在，我可以以迭代的方式获得我的目标单词列表的嵌入。但我认为这根本无效，因此我试图使用以下代码将其并行化。

from multiprocessing import Queue, Process
from multiprocessing.queues import Empty
pending=Queue()
for x in  list(model_pdf['JCOM']):
pending.put(x)
def job():
print("Started")
kill=0
while pending.qsize()>0:
try:
x = pending.get(block=True,timeout=2)
print("Can I get some element?")
word_embedding(x)
print("Can I finish some embedding?")
except Empty:
return 1
job_list=[]
for x in range(60):
job_list.append(Process(target=job))
for j in job_list:
j.start()
for j in job_list:
j.join()

然而，在切换到这个并行化版本后，我发现(从print语句(，我可以从pending队列中获得一些元素(语句"Can I get some element"正在打印中(，但无限期地停留在单词嵌入步骤，我认为有些地方出了问题。由于嵌入的单个调用只需要1-3秒，并且我发现进程没有使用任何核心(通过htop检查(

我该如何解决这种情况？

这是我的linux操作系统设置

PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

具有64核和64 gb内存的

在https://github.com/huggingface/transformers/issues/15038

相关内容

最新更新

热门标签：