熊猫:类别DTYPE和过滤器

使用pandas 0.18.1，我在过滤dtype为category的列时意识到了不同的行为。这是一个最小的例子。

import pandas as pd
import numpy as np
l = np.random.randint(1, 4, 50)
df = pd.DataFrame(dict(c_type=l, i_type=l))
df['c_type'] = df.c_type.astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
c_type    50 non-null category
i_type    50 non-null int64
dtypes: category(1), int64(1)
memory usage: 554.0 bytes

过滤整数类型列的值之一导致

df[df.i_type.isin([1, 2])].i_type.value_counts()
2    20
1    17
Name: i_type, dtype: int64

但是类别类型列上的相同过滤将值保存为输入

的值

df[df.c_type.isin([1, 2])].c_type.value_counts()
2    20
1    17
3     0
Name: c_type, dtype: int64

尽管过滤器有效，但对我来说，行为似乎很不寻常。例如，可以使用过滤器将未来的列从pivot_table函数中排除，该列在处理category时需要额外的过滤器。

是预期的行为吗？

如果检查分类文档：

类似 series.value_counts（）的系列方法也将使用所有类别，即使数据中不存在某些类别：

In [100]: s = pd.Series(pd.Categorical(["a","b","c","c"], categories=["c","a","b","d"]))
In [101]: s.value_counts()
Out[101]: 
c    2
b    1
a    1
d    0
dtype: int64

因此，如果通过5过滤（目前没有值）获取每个类别的0：

print (df[df.c_type.isin([5])].c_type.value_counts())
3    0
2    0
1    0
Name: c_type, dtype: int64

相关内容

最新更新

热门标签：