我有一个pandas dataframe,其中我在一列中的每个行中有一个长字符串(请参见变量'dframe'(。在单独的列表中,我存储了所有关键字,我必须将其与数据框中的每个字符串中的每个单词进行比较。如果找到关键字,我必须将其存储为成功并标记它,并在其中找到哪个句子。我正在使用一个复杂的循环,几乎没有" IF"陈述,这给了我正确的输出,但这不是很高效。在我的整个场景中运行近4个小时,我有130个关键字和数千行的迭代。
我想应用一些lambda功能进行优化,这是我正在努力的事情。下面我向您介绍了我的数据集和当前代码的想法。
import pandas as pd
from fuzzywuzzy import fuzz
dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
'this is a second e-mail about money',
'this would be a next message where people talk about secret information',
'this is a sentence where someone misspelled word frad',
'this sentence has no keyword']})
keywords = ['fraud','money','secret']
keyword_set = set(keywords)
dframe['Flag'] = False
dframe['part_word'] = 0
output = []
for k in range(0, len(keywords)):
count_ = 0
dframe['Flag'] = False
for j in range(0, len(dframe['Email'])):
row_list = []
print(str(k) + ' / ' + str(len(keywords)) + ' || ' + str(j) + ' / ' + str(len(dframe['Email'])))
for i in dframe['Email'][j].split():
if dframe['part_word'][j] != 0 :
row_list = dframe['part_word'][j]
fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
fuz_set = fuzz.token_set_ratio(keywords[k],i)
if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
if keywords[k] not in row_list:
row_list.append(keywords[k])
print(keywords[k] + ' found as : ' + i)
dframe['Flag'][j] = True
dframe['part_word'][j] = row_list
count_ = dframe['Flag'].values.sum()
if count_ > 0:
y = keywords[k] + ' ' + str(count_)
output.append(y)
else:
y = keywords[k] + ' ' + '0'
output.append(y)
也许有Lambda功能经验的人可能会给我一个提示,我如何将其应用于数据框架以执行类似的操作?在将整个句子分开以分离单词并以最高匹配值的条件为单独的单词之后,将需要以某种方式在lambda中应用fuzzymyting。预先感谢您的任何帮助。
我没有适合您的lambda功能,但是您可以应用于 dframe.Email
:
import pandas as pd
from fuzzywuzzy import fuzz
首先创建相同的示例数据帧,例如您:
dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
'this is a second e-mail about money',
'this would be a next message where people talk about secret information',
'this is a sentence where someone misspelled word frad',
'this sentence has no keyword']})
keywords = ['fraud','money','secret']
这是应用的功能:
def fct(sntnc, kwds):
mtch = []
for kwd in kwds:
fuz_part = [fuzz.partial_ratio(kwd.lower(), w.lower()) > 90 for w in sntnc.split()]
fuz_set = [fuzz.token_set_ratio(kwd, w) > 85 for w in sntnc.split()]
bL = [len(w) > 3 for w in sntnc.split()]
mtch.append(any([(p | s) & l for p, s, l in zip(fuz_part, fuz_set, bL)]))
return mtch
对于每个关键字它计算句子中所有单词的fuz_part > 90
,fuz_set > 85
也是如此wordlength > 3
也是如此。最后,对于每个关键字,如果句子的所有单词中有任何 ((fuz_part > 90) | (fuz_set > 85)) & (wordlength > 3)
,则将其保存在列表中。
这就是应用方式以及如何创建结果:
s = dframe.Email.apply(fct, kwds=keywords)
s = s.apply(pd.Series).set_axis(keywords, axis=1, inplace=False)
dframe = pd.concat([dframe, s], axis=1)
结果:
result = dframe.drop('Email', 1)
# fraud money secret
# 0 True True False
# 1 False True False
# 2 False False True
# 3 True False False
# 4 False False False
result.sum()
# fraud 2
# money 2
# secret 1
# dtype: int64