从numpy recarray中选择特定数据类型的记录



我有一个numpy重数组,它有不同的数据类型或dtypes的记录。

import numpy as np
a = np.array([1,2,3,4], dtype=int)
b = np.array([6,6,6,6], dtype=int)
c = np.array(['p', 'q', 'r', 's'], dtype=object)
d = np.array(['a', 'b', 'c', 'd'], dtype=object)
X = np.rec.fromarrays([a, b, c, d], names=['a', 'b', 'c', 'd'])
X
>>> rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
(4, 6, 's', 'd')],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', 'O'), ('d', 'O')])

我尝试使用select_dtypes选择对象数据类型的记录,但我得到一个属性错误

X.select_dtypes(include='object')
>>>AttributeError: recarray has no attribute select_dtypes

是否有一个等效的select_dtype函数numpy rearrays,我可以选择特定数据类型的列?

In [74]: X
Out[74]: 
rec.array([(1, 6, 'p', 'a'), (2, 6, 'q', 'b'), (3, 6, 'r', 'c'),
(4, 6, 's', 'd')],
dtype=[('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')])

recarray可以访问字段作为属性或索引:

In [75]: X.a
Out[75]: array([1, 2, 3, 4])    
In [76]: X['a']
Out[76]: array([1, 2, 3, 4])
In [77]: X.dtype.fields
Out[77]: 
mappingproxy({'a': (dtype('int32'), 0),
'b': (dtype('int32'), 4),
'c': (dtype('O'), 8),
'd': (dtype('O'), 16)})

测试pandas方法:

In [78]: import pandas as pd
In [79]: df=pd.DataFrame(X)
In [80]: df
Out[80]: 
a  b  c  d
0  1  6  p  a
1  2  6  q  b
2  3  6  r  c
3  4  6  s  d
In [83]: df.select_dtypes(include=object)
Out[83]: 
c  d
0  p  a
1  q  b
2  r  c
3  s  d

探索dtype:

In [84]: X.dtype
Out[84]: dtype((numpy.record, [('a', '<i4'), ('b', '<i4'), ('c', 'O'), ('d', 'O')]))
In [85]: X.dtype.fields
Out[85]: 
mappingproxy({'a': (dtype('int32'), 0),
'b': (dtype('int32'), 4),
'c': (dtype('O'), 8),
'd': (dtype('O'), 16)})

按字段检查dtype:

In [89]: X['a'].dtype
Out[89]: dtype('int32')    
In [90]: X['c'].dtype
Out[90]: dtype('O')    
In [91]: X['c'].dtype == object
Out[91]: True

所以列表推导是有效的:

In [93]: [name for name in X.dtype.names if X[name].dtype==object]
Out[93]: ['c', 'd']

df.select_dtypes是python代码,但相当复杂,处理包含和排除列表。

In [95]: timeit [name for name in X.dtype.names if X[name].dtype==object]
16.5 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [96]: timeit df.select_dtypes(include=object)
110 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

最新更新