Pandas标签使用无效行值的默认标签对列进行编码



对于一个数据帧,我用一系列值替换了一列中的一组项,如下所示:

df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])

我想用值0替换"Borough"中以前没有提到的所有其他元素的问题此外,我需要使用regex,因为它看起来像数据,例如07 BRONX,我还需要用5而不是0 替换它

要用0替换所有其他值,可以执行以下操作:

# create maps
new_values = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
maps = dict(zip(new_values, [1]*len(new_values)))
# map the values
df['borough_num'] = df['Borough'].apply(lambda x: maps.get(x, 0))

数据从冷使用mapfillna,所有不在映射dict中的值都将返回NaN,然后我们只返回fillna

df.Borough.map(dict(zip(['QUEENS', 'BRONX'],[1,2]))).fillna(0).astype(int)
0    1
1    2
2    2
3    0
Name: Borough, dtype: int32

我看到您想要使用一些强制顺序执行类别编码。我建议使用pd.Categoricalordered=True:

df = pd.DataFrame({
'Borough': ['QUEENS', 'BRONX', 'MANHATTAN', 'BROOKLYN', 'INVALID']})
df
Borough
0     QUEENS
1      BRONX
2  MANHATTAN
3   BROOKLYN
4    INVALID
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['borough_num'] = pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes+1
df
Borough  borough_num
0     QUEENS            3
1      BRONX            5
2  MANHATTAN            1
3   BROOKLYN            2
4    INVALID            0

pd.Categorical将无效字符串返回为-1:

pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes      
array([ 2,  4,  0,  1, -1], dtype=int8)

无论如何,这应该比使用replace很多,但作为参考,您可以使用replace和字典:

from collections import defaultdict
d = defaultdict(int)
d.update(dict(zip(keys, range(len(keys)))))
df['borough_num'] = df['Borough'].map(d)
df
Borough  borough_num
0     QUEENS            2
1      BRONX            4
2  MANHATTAN            0
3   BROOKLYN            1
4    INVALID            0

您也可以使用np.where:

创建一个伪DataFrame

df = pd.DataFrame({'Borough': ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX', 'TEST']})
df
Borough
0   MANHATTAN
1   BROOKLYN
2   QUEENS
3   STATEN ISLAND
4   BRONX
5   TEST

您的操作:

df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
df
Borough   borough_num
0   MANHATTAN       1
1   BROOKLYN        2 
2   QUEENS          3
3   STATEN ISLAND   4
4   BRONX           5
5   TEST           TEST

使用np.where:将不在键中的Borough列的值替换为0

keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['Borough'] = np.where(~df['Borough'].isin(keys), 0 ,df['Borough'])
df
Borough    borough_num
0   MANHATTAN       1
1   BROOKLYN        2
2   QUEENS          3
3   STATEN ISLAND   4
4   BRONX           5
5   0             TEST

最新更新