如何检查pandas数据框中的单词是否在字典中



假设我有一个数据框架,其中包含一列句子:

data['sentence']
0    i like to move it move it
1    i like to move ir move it
2    you like to move it
3    i liketo move it move it
4    i like to moveit move it
5    ye like to move it

我想检查哪些句子在之外有这个词,如

data['sentence']                OOV
0    i like to move it move it      False
1    i like to move ir move it      False
2    you like to move it            False
3    i liketo move it move it       True
4    i like to moveit move it       True
5    ye like to move it             True

现在我必须遍历每一行


data['OOV'] = False  # out of vocabulary
for i, row in data.iterrows():
words = set(data['sentence'].split())
for word in words:    
if word not in dictionary:
data.at[i,'OOV'] = True
break

是否有一种方法来矢量化(或加速)这个任务?

如果不知道字典的内容(我认为这更像是python意义上的列表),您的需求是不清楚的。

然而,假设引用词是"我喜欢移动它",下面是如何标记句子中包含字典外单词的行:

dictionary = set(['i', 'like', 'to', 'move', 'it'])
df['OOV'] = df['data'].str.split(' ').apply(lambda x: not set(x).issubset(dictionary))
# only for illustration:
df['words'] = df['data'].str.split(' ').apply(set)
df['words_outside'] = df['data'].str.split(' ').apply(lambda x: set(x).difference(dictionary))

输出:

data    OOV                            words words_outside
0  i like to move it move it  False          {like, to, it, i, move}            {}
1  i like to move ir move it   True      {like, to, it, i, move, ir}          {ir}
2        you like to move it   True        {move, like, to, it, you}         {you}
3   i liketo move it move it   True            {liketo, it, move, i}      {liketo}
4   i like to moveit move it   True  {like, to, it, i, move, moveit}      {moveit}
5         ye like to move it   True         {like, to, it, move, ye}          {ye}

由于我没有完整的字典上下文和其他细节,我建议使用df.apply(operation),它通常会提高速度,而不是迭代。

pandas.DataFrame.apply

最新更新