从hmni包python创建一列匹配概率



我有一个数据帧,看起来像这个

import pandas as pd
from pandas import DataFrame    
df = pd.DataFrame({'CEOThisYr': ['Douglas Davis', 'Doug Davis', 'James Taylor', 'Jane Smith'], ['CEOLastYr': 'Doug Davis', 'James Taylor', 'Jane Smith', 'Sarah Jones']})

我的目标是使用hmni包创建一个新列,其中包含各行中每个名称的匹配概率分数。我尝试的代码是:

import hmni
matcher = hmni.Matcher(model='latin')
df['MatchPercent'] = matcher.similarity(df['CEOThisYr'], df['CEOLastYr'])

返回错误:TypeError:在相似性方法中只支持字符串比较

我已经尝试过将这两列转换为字符串,但仍然返回相同的错误。你知道我哪里错了吗?

更新*

感谢下面的有用评论,我能够想出

import hmni
matcher = hmni.Matcher(model='latin')
df['MatchPercent'] = df.apply(lambda x: matcher.similarity(x['CEOThisYr'], x['CEOLastYr']), axis=1)

我从9秒处理1行到每秒处理大约5000行。

根据hmni的文档,similarity接受两个strs作为其第一个和第二个参数。您正在尝试传递两个pandas.Series,即df['CEOThisYr']df['CEOLastYr']。您可以尝试使用pandas.DataFrame.applysimilarity应用于每一行。

>>> import hmni
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'CEOThisYr': ['Douglas Davis', 'Doug Davis', 'James Taylor', 'Jane Smith'], 'CEOLastYr': ['Doug Davis', 'James Taylor', 'Jane Smith', 'Sarah Jones']})
>>> df
CEOThisYr     CEOLastYr
0  Douglas Davis    Doug Davis
1     Doug Davis  James Taylor
2   James Taylor    Jane Smith
3     Jane Smith   Sarah Jones
>>>
>>> matcher = hmni.Matcher(model='latin')
>>> df['MatchPercent'] = df.apply(lambda x: matcher.similarity(x['CEOThisYr'], x['CEOLastYr']), axis=1)
>>> df
CEOThisYr     CEOLastYr  MatchPercent
0  Douglas Davis    Doug Davis      0.922682
1     Doug Davis  James Taylor      0.000000
2   James Taylor    Jane Smith      0.000000
3     Jane Smith   Sarah Jones      0.000000

最新更新