在Python Pandas中记录分类标签



我正在努力重新编码一些分类标签。以下是我的最简单的例子。

import pandas as pd
testDict = {'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])}
testDF = pd.DataFrame.from_dict(testDict)
testDF
testDF['Col1'].value_counts()
def letter_recode(Col1):
if(Col1=="a")|(Col1=="b"):
return "ab"
elif (Col1=="c")|(Col1=="d"):
return "cd"
else:
return Col1
testDF['Col3'] = testDF['Col1'].apply(letter_recode)
testDF['Col3'].value_counts()
testDF

我想更改此df:

Col1 Col2
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

到此:

Col1 Col2 Col3
0   a   1   ab
1   b   2   ab
2   c   3   cd
3   d   4   cd
4   e   5   e

上面的方法很有效,但当我在实际数据帧上尝试此代码时,没有发生任何变化。此外,当我尝试创建我的数据帧的一小部分并运行代码时,我得到了下面的错误,并且不理解与之相关的文档

df5 = df.loc[0:4,:]
df5
age workclass   fnlwgt  education   education-num   marital-status  occupation  relationship    race    sex capital-gain    capital-loss    hours-per-week  native-country  salary  workclassR
0   50  Self-emp-not-inc    83311   Bachelors   13  Married-civ-spouse  Exec-managerial Husband White   Male    0   0   13  United-States   <=50K   Self-emp-not-inc
1   38  Private 215646  HS-grad 9   Divorced    Handlers-cleaners   Not-in-family   White   Male    0   0   40  United-States   <=50K   Private
2   53  Private 234721  11th    7   Married-civ-spouse  Handlers-cleaners   Husband Black   Male    0   0   40  United-States   <=50K   Private
3   28  Private 338409  Bachelors   13  Married-civ-spouse  Prof-specialty  Wife    Black   Female  0   0   40  Cuba    <=50K   Private
4   37  Private 284582  Masters 14  Married-civ-spouse  Exec-managerial Wife    White   Female  0   0   40  United-States   <=50K   Private
def rename_workclass(wc):
if(wc=="Never-worked")|(wc=="Without-pay"):
return "Unemployed"
elif (wc=="State-gov")|(wc=="Local-gov"):
return "Gov"
elif (wc=="Self-emp-inc")|(wc=="Self-emp-not-inc"):
return "Self-emp"
else:
return wc

df5['workclassR'] = df5['workclass'].apply(rename_workclass)

C:\Users\karol\Anaconda3\lib\site packages\ipykernel_launcher.py:12:SettingWithCopyWarning:正试图在从DataFrame切片。尝试使用.loc[row_indexer,col_indexer]=值而不是

请参阅文档中的注意事项:http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-查看与复制如果系统路径[0]==":

非常感谢您的帮助,我的问题是值前面的空白。我试着把它们比作一个没有空格的字符串。此外,可以通过声明切片数据集不是副本来消除上述错误:

df5 = df.iloc[0:4, :]  # to access the column at the nth position
df5.is_copy = False

您可以将pd.Series.map与字典一起使用,然后将fillna与原始系列一起使用:

import pandas as pd
df = pd.DataFrame({'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])})
mapper = {'a': 'ab', 'b': 'ab', 'c': 'cd', 'd': 'cd'}
df['Col3'] = df['Col1'].map(mapper).fillna(df['Col1'])
print(df['Col3'].value_counts())
cd    2
ab    2
e     1
Name: Col3, dtype: int64

尝试使用pd.Series.map()。这里的玩具示例:

s = s.map({"Private": "Private-changed", 
"Public": "Public_changed",
"?": "What is this"})
s

这给了你:

0    Private-changed
1     Public_changed
2       What is this

最新更新