将数据帧中的列表与另一个列表进行比较,如果找不到,则将其保存在另一列中



我想问,例如,我有一个词汇表和一个数据帧列表。数据帧包含标记化的句子。

vocab_list = ['aaa',....,'zzz']

数据帧

tokenized_sentenced
========
[lorem , ipsum]
[it , is, a, long, established, fact ]
[various, versions, have, evolved]
[the, generated, lorem, ipsum]

如何将词汇表列表中找不到的令牌列表存储到数据帧中的新列中。结果应该是这样的:

tokenized_sentenced                        token_not_found_in_vocab
=========================================|===========================
[lorem , ipsum]                          |[lorem, ipsum]
[it , is, a, long, established, fact ]   |[]
[various, versions, have, evolved, toq]  |[toq]
[the, generated, lorem, ipsum]           |[lorem, ipsum]

我试过这个:

for i in range(0,1005):
for j in range(0, len(df['tokenized_sentenced'][i])-1):
if (df['tokenized_sentenced'][i][j] not in vocab_list):

df['token_not_found_in_vocab'][i].append(df['tokenized_sentenced'][i][j])

但我得到了错误:

AttributeError: 'str' object has no attribute 'append'

以下内容可以在一行中解决您的问题:

df['token_not_found_in_vocab'] = df['tokenized_sentenced'].apply(lambda x: list(set(x).difference(vocab_list)))

最新更新