我正在处理一个数据帧,一些数据列缺少由列中的'?'
表示的类别。我正在尝试使用布尔值来重命名标记为workclass
的列中丢失的'?'
类别,并将其替换为'Private'
。数据读取为:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None)
##Assigning column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
我已经尝试运行代码:
MissingValue = Adult.loc[:, "workclass"] == "?"
Adult.loc[MissingValue, "workclass"] = "Private"
和
Adult.loc[ Adult.loc[:, "workclass"] == "?", "workclass"] = "Private"
当运行代码时,我没有得到任何错误,但是当用(Adult.loc[:,'workclass'].value_counts())
检查列值时,'?'
仍然存在。代码:Adult['workclass'] = Adult['workclass'].str.replace('?', 'Private')
适用于我想要完成的任务,但我希望能够使用布尔值来完成。关于为什么会发生这种情况,有什么建议吗?
问题是您的值与'?'
不完全匹配,但可能类似于"?">
你可以看到这是因为:
Adult.loc[Adult['workclass']=='?',:]
返回一个空数据帧,而
Adult.loc[Adult['workclass'].str.strip()=='?',:]
返回1836行
strip
删除前导和尾随空白,因此您不必测试' ?'
、'? '
、' ? '
等
所以当你像这个一样稍微更改代码时
MissingValue = Adult.loc[:, "workclass"].str.strip() == "?"
Adult.loc[MissingValue, "workclass"] = "Private"
你会看到"?"已从value_counts()
中消失
分隔符后面有空格,所以添加skipinitialspace
参数:
Adult = pd.read_csv(url2, header=None, skipinitialspace=True)
然后正确工作您的代码:
MissingValue = Adult["workclass"] == "?"
Adult.loc[MissingValue, "workclass"] = "Private"
print ((Adult['workclass'].value_counts().index.tolist()))
['Private', 'Self-emp-not-inc', 'Local-gov', 'State-gov',
'Self-emp-inc', 'Federal-gov', 'Without-pay', 'Never-worked']
print ((Adult['workclass'].value_counts()))
Private 24532
Self-emp-not-inc 2541
Local-gov 2093
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
使用您的代码验证空间:
Adult = pd.read_csv(url2, header=None)
#Assigning column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
print ((Adult['workclass'].value_counts().index.tolist()))
[' Private', ' Self-emp-not-inc', ' Local-gov', ' ?', ' State-gov',
' Self-emp-inc', ' Federal-gov', ' Without-pay', ' Never-worked']