我有一个numpy重数组,它有不同的数据类型或dtypes的记录。
import numpy as np
a = np.array([1,2,3,4], dtype=int)
b = np.array([6,6,6,6], dtype=int)
c = np.array(['p', 'q', 'r', 's'], dtype=object)
d = np.array(['a', 'b', 'c', 'd'], dtype=object)
X = np.rec.fromarrays([a, b, c, d], names=['a', 'b', 'c', 'd'])
X
>>> rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
(4, 6, 's', 'd')],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', 'O'), ('d', 'O')])
我尝试使用select_dtypes
选择对象数据类型的记录,但我得到一个属性错误
X.select_dtypes(include='object')
>>>AttributeError: recarray has no attribute select_dtypes
是否有一个等效的select_dtype
函数numpy rearrays,我可以选择特定数据类型的列?
In [74]: X
Out[74]:
rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
(4, 6, 's', 'd')],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')])
recarray可以访问字段作为属性或索引:
In [75]: X.a
Out[75]: array([1, 2, 3, 4])
In [76]: X['a']
Out[76]: array([1, 2, 3, 4])
In [77]: X.dtype.fields
Out[77]:
mappingproxy({'a': (dtype('int32'), 0),
'b': (dtype('int32'), 4),
'c': (dtype('O'), 8),
'd': (dtype('O'), 16)})
测试pandas方法:
In [78]: import pandas as pd
In [79]: df=pd.DataFrame(X)
In [80]: df
Out[80]:
a b c d
0 1 6 p a
1 2 6 q b
2 3 6 r c
3 4 6 s d
In [83]: df.select_dtypes(include=object)
Out[83]:
c d
0 p a
1 q b
2 r c
3 s d
探索dtype:
In [84]: X.dtype
Out[84]: dtype((numpy.record, [('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')]))
In [85]: X.dtype.fields
Out[85]:
mappingproxy({'a': (dtype('int32'), 0),
'b': (dtype('int32'), 4),
'c': (dtype('O'), 8),
'd': (dtype('O'), 16)})
按字段检查dtype:
In [89]: X['a'].dtype
Out[89]: dtype('int32')
In [90]: X['c'].dtype
Out[90]: dtype('O')
In [91]: X['c'].dtype == object
Out[91]: True
所以列表推导是有效的:
In [93]: [name for name in X.dtype.names if X[name].dtype==object]
Out[93]: ['c', 'd']
df.select_dtypes
是python代码,但相当复杂,处理包含和排除列表。
In [95]: timeit [name for name in X.dtype.names if X[name].dtype==object]
16.5 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [96]: timeit df.select_dtypes(include=object)
110 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)