如何使用布尔值重命名和替换列中的值



我正在处理一个数据帧,一些数据列缺少由列中的'?'表示的类别。我正在尝试使用布尔值来重命名标记为workclass的列中丢失的'?'类别,并将其替换为'Private'。数据读取为:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None)
##Assigning column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]

我已经尝试运行代码:

MissingValue = Adult.loc[:, "workclass"] == "?"
Adult.loc[MissingValue, "workclass"] = "Private"

Adult.loc[ Adult.loc[:, "workclass"] == "?", "workclass"] = "Private"

当运行代码时,我没有得到任何错误,但是当用(Adult.loc[:,'workclass'].value_counts())检查列值时,'?'仍然存在。代码:Adult['workclass'] = Adult['workclass'].str.replace('?', 'Private')适用于我想要完成的任务,但我希望能够使用布尔值来完成。关于为什么会发生这种情况,有什么建议吗?

问题是您的值与'?'不完全匹配,但可能类似于"?">

你可以看到这是因为:

Adult.loc[Adult['workclass']=='?',:]

返回一个空数据帧,而

Adult.loc[Adult['workclass'].str.strip()=='?',:]

返回1836行

strip删除前导和尾随空白,因此您不必测试' ?''? '' ? '

所以当你像这个一样稍微更改代码时

MissingValue = Adult.loc[:, "workclass"].str.strip() == "?"
Adult.loc[MissingValue, "workclass"] = "Private"

你会看到"?"已从value_counts()中消失

分隔符后面有空格,所以添加skipinitialspace参数:

Adult = pd.read_csv(url2, header=None, skipinitialspace=True)

然后正确工作您的代码:

MissingValue = Adult["workclass"] == "?"
Adult.loc[MissingValue, "workclass"] = "Private"
print ((Adult['workclass'].value_counts().index.tolist()))
['Private', 'Self-emp-not-inc', 'Local-gov', 'State-gov', 
'Self-emp-inc', 'Federal-gov', 'Without-pay', 'Never-worked']
print ((Adult['workclass'].value_counts()))
Private             24532
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

使用您的代码验证空间:

Adult = pd.read_csv(url2, header=None)
#Assigning column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
print ((Adult['workclass'].value_counts().index.tolist()))
[' Private', ' Self-emp-not-inc', ' Local-gov', ' ?', ' State-gov',
' Self-emp-inc', ' Federal-gov', ' Without-pay', ' Never-worked']

最新更新