我有一个数据帧,看起来像这个
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame({'CEOThisYr': ['Douglas Davis', 'Doug Davis', 'James Taylor', 'Jane Smith'], ['CEOLastYr': 'Doug Davis', 'James Taylor', 'Jane Smith', 'Sarah Jones']})
我的目标是使用hmni包创建一个新列,其中包含各行中每个名称的匹配概率分数。我尝试的代码是:
import hmni
matcher = hmni.Matcher(model='latin')
df['MatchPercent'] = matcher.similarity(df['CEOThisYr'], df['CEOLastYr'])
返回错误:TypeError:在相似性方法中只支持字符串比较
我已经尝试过将这两列转换为字符串,但仍然返回相同的错误。你知道我哪里错了吗?
更新*
感谢下面的有用评论,我能够想出
import hmni
matcher = hmni.Matcher(model='latin')
df['MatchPercent'] = df.apply(lambda x: matcher.similarity(x['CEOThisYr'], x['CEOLastYr']), axis=1)
我从9秒处理1行到每秒处理大约5000行。
根据hmni的文档,similarity
接受两个str
s作为其第一个和第二个参数。您正在尝试传递两个pandas.Series
,即df['CEOThisYr']
和df['CEOLastYr']
。您可以尝试使用pandas.DataFrame.apply
将similarity
应用于每一行。
>>> import hmni
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'CEOThisYr': ['Douglas Davis', 'Doug Davis', 'James Taylor', 'Jane Smith'], 'CEOLastYr': ['Doug Davis', 'James Taylor', 'Jane Smith', 'Sarah Jones']})
>>> df
CEOThisYr CEOLastYr
0 Douglas Davis Doug Davis
1 Doug Davis James Taylor
2 James Taylor Jane Smith
3 Jane Smith Sarah Jones
>>>
>>> matcher = hmni.Matcher(model='latin')
>>> df['MatchPercent'] = df.apply(lambda x: matcher.similarity(x['CEOThisYr'], x['CEOLastYr']), axis=1)
>>> df
CEOThisYr CEOLastYr MatchPercent
0 Douglas Davis Doug Davis 0.922682
1 Doug Davis James Taylor 0.000000
2 James Taylor Jane Smith 0.000000
3 Jane Smith Sarah Jones 0.000000