你好,我有以下表格
value | description | 2 | Aron | 2
---|---|
亚伦 | |
Aronn | |
3 | 约翰 |
3 | 乔恩 |
Liam |
使用duplicated()
和ffill()
:
>>> df.where(~df["value"].duplicated()).ffill()
value description
0 2.0 Aron
1 2.0 Aron
2 2.0 Aron
3 3.0 John
4 3.0 John
5 4.0 Liam
将数据帧按值分组,然后调用transform,将first
作为函数传递,取组内Description
列的第一个值
>>> df['description']=df.groupby('value')['description'].transform('first')
value description
0 2 Aron
1 2 Aron
2 2 Aron
3 3 John
4 3 John
5 4 Liam
我可能离得太远了,但看起来你正试图"修复"名字。如果是这样,也许下面的代码是值得思考的。
它接受一个像…
value description
0 2 Aron
1 2 Aaron
2 2 Aronn
3 3 John
4 3 Jhon
5 4 Liam
…和输出…
value description
0 2 Aaron
1 2 Aaron
2 2 Aaron
3 3 John
4 3 John
5 4 Liam
这是基于获得已知名字的良好列表(好吧,我从快速Google中找到的第一个列表:0)并使用fuzzywuzzy尝试匹配列description
中已经存在的内容。也许我们可以改正一些错别字!
代码:
from fuzzywuzzy import fuzz, process
import pandas as pd
import requests #not always required but needed for some sites so doing it this way
# create a dataframe with a column of names in 'description'
df = pd.DataFrame({'value': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4},
'description': {0: 'Aron',
1: 'Aaron',
2: 'Aronn',
3: 'John',
4: 'Jhon',
5: 'Liam'}})
# find a list somewhere on the t'internet of first names (Liam, John, etc. looks like English)
url = r'https://www.britishbabynames.com/blog/data-analysis-and-statistics/'
# this takes about 20 seconds!
page = requests.get(url)
table = pd.read_html(page.text)
# tidy up the table we downloaded
df_bnames = table[0][2:]
df_bnames = df_bnames[1]
df_bnames = df_bnames.dropna()
df_bnames = df_bnames.reset_index(drop=True)
# put the names from the web into a list
source_names = list(df_bnames)
# fuzzywuzzy function - magic kung fu
def match_names(name, list_names, min_score=0):
max_score = -1
max_name = ''
for x in list_names:
score = fuzz.ratio(name, x)
if (score > min_score) & (score > max_score):
max_name = x
max_score = score
return (max_name, max_score)
# from the original frame get the unique names as a list
uniquenames = list(df.description.unique())
# score limit - may need to adjust up or down. Depends what's in the dict below
score = 50
#using fuzzywuzzy compare one name list to another
names = []
for x in uniquenames:
match = match_names(x, source_names, score)
if match[1] >= score:
name = (str(x), str(match[0]))
names.append(name)
# we have a dictionary of {...'Jhon': 'John'...} now.
name_dict = dict(names)
# output the original frame
print(df,'n')
# swap the values in 'description' with the 'better' names in the dictionary
df = df.replace({'description': name_dict})
# see the changes
print(df)
所以结果是…
value description
0 2 Aaron
1 2 Aaron
2 2 Aaron
3 3 John
4 3 John
5 4 Liam
固定Jhon
,替换Aron
和Aronn
。