我在numpy中有一个数组，在检查它的过程中，我意识到一个特定列中的一些值包含一串胡言乱语。

例如，可疑列是第二列，它看起来像这样：

['Joe', '200.00']
['Fred', 'adfdfddfds']
['Zhu', '5000.00']
['text_ok_here', '10.10']

（请注意，dtype是字符串）

我希望最终拥有：

['Joe', '200.00']
['Zhu', '5000.00']
['text_ok_here', '10.10']

我需要删除任何一个完整的行，其中有一个字符串我无法转换为浮动，挂在我的特定列中。

最初，我想我可以遍历该列，收集与有问题的条目匹配的索引，并用它来为我的原始数组子集。

大致如下：

for entry in my particular column:
    if <entry is a string, not a float>
        <delete that whole row of the matrix>

但这不会起作用，因为不管怎样，一切都是一串。

我一直被转换类型的问题所困扰，但我没有简单的方法来测试这些胡言乱语。此外，即使我确实找到了正确的索引，我也不确定如何进行子集设置。

我觉得这是很常见的事情——清理一个数组，但在完成这项工作时却遇到了令人惊讶的困难。

任何建议/哲学等都将不胜感激。

要知道您的数据在dtype中，这将是非常重要的，但如果是float或int或任何数字（如dtypes），boolean索引将足够

数据文件：

<temp.txt>
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 bad
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 bad
1 2 3 4 5 6
1 2 3 4 5 6

解决方案：

In [9]:
A=np.genfromtxt('temp.txt')
A
Out[9]:
array([[  1.,   2.,   3.,   4.,   5.,   6.],
       [  1.,   2.,   3.,   4.,   5.,   6.],
       [  1.,   2.,   3.,   4.,   5.,  nan],
       [  1.,   2.,   3.,   4.,   5.,   6.],
       [  1.,   2.,   3.,   4.,   5.,   6.],
       [  1.,   2.,   3.,   4.,   5.,  nan],
       [  1.,   2.,   3.,   4.,   5.,   6.],
       [  1.,   2.,   3.,   4.,   5.,   6.]])
In [10]:
np.isfinite(A).all(1) #only TRUE when all the cells in the row are valid number
Out[10]:
array([ True,  True, False,  True,  True, False,  True,  True], dtype=bool)
In [11]:
A[np.isfinite(A).all(1)]
Out[11]:
array([[ 1.,  2.,  3.,  4.,  5.,  6.],
       [ 1.,  2.,  3.,  4.,  5.,  6.],
       [ 1.,  2.,  3.,  4.,  5.,  6.],
       [ 1.,  2.,  3.,  4.,  5.,  6.],
       [ 1.,  2.,  3.,  4.,  5.,  6.],
       [ 1.,  2.,  3.,  4.,  5.,  6.]])

编辑

如果array已经在string中，这可能是最简单的：

In [40]:
%%file temp.txt
1000.00 200.00
4000.00 adfdfddfds
20.00 5000
text_ok_here 5000
Overwriting temp.txt
In [53]:
A=np.genfromtxt('temp.txt', dtype=str)
B=np.genfromtxt('temp.txt')
In [55]:
A[np.isfinite(B[:,1])]
Out[55]:
array([['1000.00', '200.00'],
       ['20.00', '5000'],
       ['text_ok_here', '5000']], 
      dtype='|S12')

基本上将数据作为strings的array读取到A中；在不能转换为有效数字的地方，将B转换为float和NAN，然后在B的基础上得到A的切片。

这里有一种方法。遍历检查条件的行的数字索引。如果不满足条件，请将索引添加到列表keep中，该列表包含要保留的行索引。然后，您可以使用列表keep对数组进行切片，以获得仅包含不满足消除条件的行的数组。要使用列表keep对数组a进行切片，请执行a[keep]。如果要覆盖原始数组，请执行a = a[keep]。下面是一个示例，它在切片前打印数组、要保留的索引列表以及切片后的数组。

#!/usr/bin/env python
import numpy
a = numpy.array([['foo', 2, 3], [4, 5, None], [7, 8, 'bar'], [10, None, 12]])
print(a)
keep = []
j = 2
for i in range(0, a.shape[0]) :
    if not(a[i, j] is None or isinstance(a[i, j], basestring)) :
        keep.append(i)
print keep
a2 = a[keep]
print(a2)

在numpy中，删除列中无法将条目转换为浮点值的行

编辑

相关内容

最新更新

热门标签：