我正在努力重新编码一些分类标签。以下是我的最简单的例子。
import pandas as pd
testDict = {'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])}
testDF = pd.DataFrame.from_dict(testDict)
testDF
testDF['Col1'].value_counts()
def letter_recode(Col1):
if(Col1=="a")|(Col1=="b"):
return "ab"
elif (Col1=="c")|(Col1=="d"):
return "cd"
else:
return Col1
testDF['Col3'] = testDF['Col1'].apply(letter_recode)
testDF['Col3'].value_counts()
testDF
我想更改此df:
Col1 Col2
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
到此:
Col1 Col2 Col3
0 a 1 ab
1 b 2 ab
2 c 3 cd
3 d 4 cd
4 e 5 e
上面的方法很有效,但当我在实际数据帧上尝试此代码时,没有发生任何变化。此外,当我尝试创建我的数据帧的一小部分并运行代码时,我得到了下面的错误,并且不理解与之相关的文档
df5 = df.loc[0:4,:]
df5
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary workclassR
0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K Self-emp-not-inc
1 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K Private
2 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K Private
3 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K Private
4 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K Private
def rename_workclass(wc):
if(wc=="Never-worked")|(wc=="Without-pay"):
return "Unemployed"
elif (wc=="State-gov")|(wc=="Local-gov"):
return "Gov"
elif (wc=="Self-emp-inc")|(wc=="Self-emp-not-inc"):
return "Self-emp"
else:
return wc
df5['workclassR'] = df5['workclass'].apply(rename_workclass)
C:\Users\karol\Anaconda3\lib\site packages\ipykernel_launcher.py:12:SettingWithCopyWarning:正试图在从DataFrame切片。尝试使用.loc[row_indexer,col_indexer]=值而不是
请参阅文档中的注意事项:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-查看与复制如果系统路径[0]==":
非常感谢您的帮助,我的问题是值前面的空白。我试着把它们比作一个没有空格的字符串。此外,可以通过声明切片数据集不是副本来消除上述错误:
df5 = df.iloc[0:4, :] # to access the column at the nth position
df5.is_copy = False
您可以将pd.Series.map
与字典一起使用,然后将fillna
与原始系列一起使用:
import pandas as pd
df = pd.DataFrame({'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])})
mapper = {'a': 'ab', 'b': 'ab', 'c': 'cd', 'd': 'cd'}
df['Col3'] = df['Col1'].map(mapper).fillna(df['Col1'])
print(df['Col3'].value_counts())
cd 2
ab 2
e 1
Name: Col3, dtype: int64
尝试使用pd.Series.map()
。这里的玩具示例:
s = s.map({"Private": "Private-changed",
"Public": "Public_changed",
"?": "What is this"})
s
这给了你:
0 Private-changed
1 Public_changed
2 What is this