我正在尝试使用BERT为单词列表生成单词嵌入。
preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
small_bert = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1'
bert_pre = hub.KerasLayer(preprocess)
bert_model = hub.KerasLayer(small_bert)
def word_embedding(sent):
if type(sent)==str:
sent = [sent]
return bert_model(bert_pre(sent))['pooled_output']
上面的代码提供了一个从字符串/字符串列表中获取单词嵌入的函数
代码工作得很好,我能够获得像下面一样的单词嵌入
word_embedding("This is a dog")
<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[-0.9999982 , 0.09135217, -0.9993329 , 0.96194535, -0.9991869 ,
0.07579318, -0.9892281 , -0.9668794 , 0.06745824, 0.07896714,
-0.81101316, 0.01557093, -0.11514608, 1. , -0.9598647 ,
-0.83923537, 0.8639652 , 0.04229677, -0.9274309 , 0.87699145,
0.9675325 , 0.02446483, 0.96745443, 0.9137065 , -0.99994516,
-0.00484818, -0.9999251 , 0.97071564, 0.9577635 , 0.12064736,
0.13986832, 0.01977904, -0.99208915, 0.1142559 , 0.98749965,
0.9999112 , -0.93131316, -0.05863096, 0.9166601 , -0.9995932 ,
0.92014617, 0.94889516, -0.9995392 , 0.9891672 , -0.9999985 ,
-0.15958041, -0.99989873, 0.9984788 , 0.9674587 , 0.9849434 ,
0.9884202 , -0.5348996 , 0.07992988, 0.9977897 , 0.99813277,
0.9999679 , -0.99952024, -0.970687 , 0.9040054 , -0.9458808 ,
0.01751512, 0.36849502, 0.3939357 , 0.9101836 , -0.15718344,
-0.99999946, -0.44872832, -0.6077106 , 0.96371555, 0.5564301 ,
0.9982054 , -0.09481359, -0.9996993 , 0.03875534, 0.65399134,
-0.9902064 , 0.66297245, 0.10515413, -0.97484773, 0.18679208,
-0.5837009 , -0.12993163, -0.96478623, -0.99981767, 0.99985546,
-0.98870945, 0.8561884 , -0.58723676, -0.68301636, 0.67417735,
-0.9766185 , 0.9956491 , -0.88204795, 0.99866074, 0.2829505 ,
0.42085564, -0.9546872 , -0.8894943 , -0.9999068 , -0.97645766,
-0.99447215, 0.97132486, -0.9995873 , -0.90443873, -0.9787839 ,
-0.6670069 , -0.9991659 , -0.9913582 , -0.19619215, 0.9979996 ,
0.99873877, 0.94075304, -0.76902175, 0.9997495 , -1. ,
0.06643485, 0.8816498 , 0.83833504, 0.09686996, -0.9954674 ,
0.22044522, -0.99998134, -0.5231443 , 0.902108 , -0.9998227 ,
0.97717 , 0.9373147 , 0.9990008 ]], dtype=float32)>
所以现在,我可以以迭代的方式获得我的目标单词列表的嵌入。但我认为这根本无效,因此我试图使用以下代码将其并行化。
from multiprocessing import Queue, Process
from multiprocessing.queues import Empty
pending=Queue()
for x in list(model_pdf['JCOM']):
pending.put(x)
def job():
print("Started")
kill=0
while pending.qsize()>0:
try:
x = pending.get(block=True,timeout=2)
print("Can I get some element?")
word_embedding(x)
print("Can I finish some embedding?")
except Empty:
return 1
job_list=[]
for x in range(60):
job_list.append(Process(target=job))
for j in job_list:
j.start()
for j in job_list:
j.join()
然而,在切换到这个并行化版本后,我发现(从print语句(,我可以从pending
队列中获得一些元素(语句"Can I get some element"正在打印中(,但无限期地停留在单词嵌入步骤,我认为有些地方出了问题。由于嵌入的单个调用只需要1-3秒,并且我发现进程没有使用任何核心(通过htop
检查(
我该如何解决这种情况?
这是我的linux操作系统设置
PRETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
具有64核和64 gb内存的
在https://github.com/huggingface/transformers/issues/15038