我想问,例如,我有一个词汇表和一个数据帧列表。数据帧包含标记化的句子。
vocab_list = ['aaa',....,'zzz']
数据帧
tokenized_sentenced
========
[lorem , ipsum]
[it , is, a, long, established, fact ]
[various, versions, have, evolved]
[the, generated, lorem, ipsum]
如何将词汇表列表中找不到的令牌列表存储到数据帧中的新列中。结果应该是这样的:
tokenized_sentenced token_not_found_in_vocab
=========================================|===========================
[lorem , ipsum] |[lorem, ipsum]
[it , is, a, long, established, fact ] |[]
[various, versions, have, evolved, toq] |[toq]
[the, generated, lorem, ipsum] |[lorem, ipsum]
我试过这个:
for i in range(0,1005):
for j in range(0, len(df['tokenized_sentenced'][i])-1):
if (df['tokenized_sentenced'][i][j] not in vocab_list):
df['token_not_found_in_vocab'][i].append(df['tokenized_sentenced'][i][j])
但我得到了错误:
AttributeError: 'str' object has no attribute 'append'
以下内容可以在一行中解决您的问题:
df['token_not_found_in_vocab'] = df['tokenized_sentenced'].apply(lambda x: list(set(x).difference(vocab_list)))