我有这个数据帧,我正在努力清除人们输入名字和姓氏错误的拼写错误。我将如何清理数据集?我可以使用条件语句来提供帮助吗?
Date Last First City Type
2016-01-01 smith john Riley Park Staff
2016-01-02 smit john Riley Park Staff
2016-01-03 smith john Riley Park Staff
2016-01-04 smith joh Riley Park Staff
2016-01-08 smith john Riley Park Contractor
2016-01-04 smith john Fairview Staff
2016-01-02 baker bob Strathcona Staff
2016-01-03 bake bob Strathcona Staff
2016-01-04 baker bob Strathcona Staff
所需的已清理数据集
Date Last First City Type
2016-01-01 smith john Riley Park Staff
2016-01-02 smith john Riley Park Staff
2016-01-03 smith john Riley Park Staff
2016-01-04 smith john Riley Park Staff
2016-01-08 smith john Riley Park Contractor
2016-01-04 smith john Fairview Staff
2016-01-02 baker bob Strathcona Staff
2016-01-03 baker bob Strathcona Staff
2016-01-04 baker bob Strathcona Staff
我真的很困惑该如何清理,我想过创建其他数据帧,然后合并它,但我希望有专家能帮助我。
编辑:我只想在城市和类型的员工相同的情况下更换它。
from thefuzz import fuzz
def correct_typo(typo, ref_names, ratio=80):
for name in ref_names :
if fuzz.ratio(typo, name) > ratio :
return name
return typo
您可以使用带有选择条件的Where,如果未完成则更改您的值
df=pd.DataFrame({"Date":["2016-01-01","2016-01-02","2016-01-03"],"Name['smith','smi',"Fathallah"],"LastName":["john","jon","Mohamed"]})
Date Name LastName
2016-01-01 smith john
2016-01-02 smi jon
2016-01-03 Fathallah Mohamed
df["LastName"].where(lambda x:x[:2]=="jo","john",inplace=True)
df["Name"].where(lambda x:x[:2]=="sm","smith",inplace=True)
Date Name LastName
2016-01-01 smith john
2016-01-02 smith john
2016-01-03 Fathallah Mohamed
如果您有一个所有打字错误的列表,只需使用replace:
df.replace(['smit', 'joh', 'bake'], ['smith', 'john', 'baker'])
如果你确定在拼写错误上方的行中总是有一个正确的值,请使用替换为"ffill"方法:
df.replace(['joh', 'bake', 'smit'], method='ffill')
如果只有城市和类型相同,则更换:
df_gby = df.groupby(['City', 'Type'])
pd.concat(
[
df_gby.get_group(group).replace(['joh', 'bake', 'smit'], method='ffill')
for group in df_gby.groups
]
)
上面,我们按照City和Type对df进行了分组,对每个组进行迭代并进行替换。
通过这种方式,我们正在与具有相同价值观的团队合作。