背景
我有一个df
import pandas as pd
import nltk
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df= pd.DataFrame({'ID': [1,2,3],
'Text':['This num dogs and cats is (111)888-8780 and other',
'dont block cow 23 here',
'cat two num: dog and cows here']
})
我还有一个列表
word_list = ['dog', 'cat', 'cow']
以及在df的Text
列上与word_list
进行模糊匹配的函数
def fuzzy(row, word_list):
tweet = row[0]
fuzzy_match = []
for word in word_list:
token_words = nltk.word_tokenize(tweet)
for token in range(0, len(token_words) - 1):
fuzzy_fx = process.extract(word_list[word], token_words[token], limit=100, scorer = fuzz.ratio)
fuzzy_match.append(fuzzy_fx[0])
return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
然后我加入
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
但我收到一个错误
TypeError: expected string or bytes-like object
所需输出我所需的输出是1(具有fuzzy
函数输出的新列Fuzzy_Match
ID Text Fuzzy_Match
0 1 This num dogs and cats is (111)888-8780 and other output of fuzzy 1
1 2 dont block cow 23 here output of fuzzy 2
2 3 cat two num: dog and cows here output of fuzzy 3
问题我需要做什么才能获得所需的输出?
这应该有效:
In [32]: def fuzzy(row, word_list):
...: tweet = row[1]
...: fuzzy_match = []
...: token_words = nltk.word_tokenize(tweet)
...: for word in word_list:
...:
...: fuzzy_fx = process.extract(word, token_words, limit=100, scorer = fuzz.ratio)
...: fuzzy_match.append(fuzzy_fx[0])
...:
...: return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
process.extract()
需要一个列表作为第二个参数。你可以在这里阅读更多关于它的信息。python fuzzywuzzy';s process.textract((:它是如何工作的?