将pandas数据框中的值替换为具有特定条件的其他列中的值



你好,我有以下表格

tbody> <<tr>224
value description
2Aron
亚伦
Aronn
3约翰
3乔恩
Liam

使用duplicated()ffill():

>>> df.where(~df["value"].duplicated()).ffill()
value description
0    2.0        Aron
1    2.0        Aron
2    2.0        Aron
3    3.0        John
4    3.0        John
5    4.0        Liam

将数据帧按值分组,然后调用transform,将first作为函数传递,取组内Description列的第一个值

>>> df['description']=df.groupby('value')['description'].transform('first')
value description
0      2        Aron
1      2        Aron
2      2        Aron
3      3        John
4      3        John
5      4        Liam

我可能离得太远了,但看起来你正试图"修复"名字。如果是这样,也许下面的代码是值得思考的。

它接受一个像…

value description
0      2        Aron
1      2       Aaron
2      2       Aronn
3      3        John
4      3        Jhon
5      4        Liam

…和输出…

value description
0      2       Aaron
1      2       Aaron
2      2       Aaron
3      3        John
4      3        John
5      4        Liam

这是基于获得已知名字的良好列表(好吧,我从快速Google中找到的第一个列表:0)并使用fuzzywuzzy尝试匹配列description中已经存在的内容。也许我们可以改正一些错别字!

代码:

from fuzzywuzzy import fuzz, process
import pandas as pd
import requests #not always required but needed for some sites so doing it this way
# create a dataframe with a column of names in 'description'
df = pd.DataFrame({'value': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4},
'description': {0: 'Aron',
1: 'Aaron',
2: 'Aronn',
3: 'John',
4: 'Jhon',
5: 'Liam'}})
# find a list somewhere on the t'internet of first names (Liam, John, etc. looks like English)
url = r'https://www.britishbabynames.com/blog/data-analysis-and-statistics/'
# this takes about 20 seconds!
page = requests.get(url)
table = pd.read_html(page.text)
# tidy up the table we downloaded
df_bnames = table[0][2:]
df_bnames = df_bnames[1]
df_bnames = df_bnames.dropna()
df_bnames = df_bnames.reset_index(drop=True)
# put the names from the web into a list
source_names = list(df_bnames)
# fuzzywuzzy function - magic kung fu
def match_names(name, list_names, min_score=0):
max_score = -1
max_name = ''
for x in list_names:
score = fuzz.ratio(name, x)
if (score > min_score) & (score > max_score):
max_name = x
max_score = score
return (max_name, max_score)
# from the original frame get the unique names as a list
uniquenames = list(df.description.unique())
# score limit - may need to adjust up or down.  Depends what's in the dict below
score = 50
#using fuzzywuzzy compare one name list to another
names = []
for x in uniquenames:
match = match_names(x, source_names, score)
if match[1] >= score:
name = (str(x), str(match[0]))
names.append(name)
# we have a dictionary of {...'Jhon': 'John'...} now.  
name_dict = dict(names)
# output the original frame
print(df,'n')
# swap the values in 'description' with the 'better' names in the dictionary
df = df.replace({'description': name_dict})
# see the changes
print(df)

所以结果是…

value description
0      2       Aaron
1      2       Aaron
2      2       Aaron
3      3        John
4      3        John
5      4        Liam

固定Jhon,替换AronAronn

相关内容

最新更新