假设我有一个数据框架,其中包含一列句子:
data['sentence']
0 i like to move it move it
1 i like to move ir move it
2 you like to move it
3 i liketo move it move it
4 i like to moveit move it
5 ye like to move it
我想检查哪些句子在之外有这个词,如
data['sentence'] OOV
0 i like to move it move it False
1 i like to move ir move it False
2 you like to move it False
3 i liketo move it move it True
4 i like to moveit move it True
5 ye like to move it True
现在我必须遍历每一行
data['OOV'] = False # out of vocabulary
for i, row in data.iterrows():
words = set(data['sentence'].split())
for word in words:
if word not in dictionary:
data.at[i,'OOV'] = True
break
是否有一种方法来矢量化(或加速)这个任务?
如果不知道字典的内容(我认为这更像是python意义上的列表),您的需求是不清楚的。
然而,假设引用词是"我喜欢移动它",下面是如何标记句子中包含字典外单词的行:
dictionary = set(['i', 'like', 'to', 'move', 'it'])
df['OOV'] = df['data'].str.split(' ').apply(lambda x: not set(x).issubset(dictionary))
# only for illustration:
df['words'] = df['data'].str.split(' ').apply(set)
df['words_outside'] = df['data'].str.split(' ').apply(lambda x: set(x).difference(dictionary))
输出:
data OOV words words_outside
0 i like to move it move it False {like, to, it, i, move} {}
1 i like to move ir move it True {like, to, it, i, move, ir} {ir}
2 you like to move it True {move, like, to, it, you} {you}
3 i liketo move it move it True {liketo, it, move, i} {liketo}
4 i like to moveit move it True {like, to, it, i, move, moveit} {moveit}
5 ye like to move it True {like, to, it, move, ye} {ye}
由于我没有完整的字典上下文和其他细节,我建议使用df.apply(operation)
,它通常会提高速度,而不是迭代。
pandas.DataFrame.apply