在numpy数组中找到类型转换失败的索引

我有一个1D numpy阵列的字符串，我需要转换为新的dtype。新类型可以是INT，Float或DateTime Type 。某些字符串对于该类型可能是无效的，并且不能转换，这会导致错误，例如：

>>> np.array(['10', '20', 'a'], dtype=int)
...
ValueError: invalid literal for int() with base 10: 'a'

我想找到该无效值的索引，在这种情况下2。当前我只能想到两个解决方案，这两个解决方案都不好：

解析具有正则条件的异常消息以找到无效的值，然后在原始数组中找到该值的索引。这似乎是混乱且容易出错的。
在Python中的循环中解析值。这可能比Numpy版本要慢得多。例如，这是我做的一个实验：

from timeit import timeit
import numpy as np
strings = np.array(list(map(str, range(10000000))))

def python_parse(arr):
    result = []
    for i, x in enumerate(arr):
        try:
            result.append(int(x))
        except ValueError:
            raise Exception(f'Failed at: {i}')

print(timeit(lambda: np.array(strings, dtype=int), number=10))  # 35 seconds
print(timeit(lambda: python_parse(strings), number=10))         # 52 seconds

这似乎是一个简单而常见的操作，我希望在numpy库中内置解决方案，但我找不到一个。

您可以使用 np.core.defchararray.isdigit()查找数字的索引，然后使用逻辑上的操作数来获取Nan Digit项目的索引。之后，您可以使用np.where()获取相应的索引：

In [20]: arr = np.array(['10', '20', 'a', '4', '%'])
In [24]: np.where(~np.core.defchararray.isdigit(arr))
Out[24]: (array([2, 4]),)

如果要检查多种类型(例如float(，则可以使用自定义功能，然后使用np.vectorize将功能应用于数组。对于日期，这有点棘手，但是如果您想要一种一般方法，则可能要使用dateutils.parser()。

您可以使用如下以下功能：

# from dateutils import parser
In [33]: def check_type(item):
    ...:     try:
    ...:         float(item)
    ...:     except:
    ...:         try:         
    ...:             parser.parse(item)
    ...:         except:     
    ...:             return True
    ...:         else:      
    ...:             return False
    ...:     else:          
    ...:         return False

然后：

vector_func = np.vectorize(check_type)
np.where(vector_func(arr))

演示：

In [45]: arr = np.array(['10.34', '-20', 'a', '4', '%', '2018-5-01'])
In [46]: vector_func = np.vectorize(check_type)
    ...: np.where(vector_func(arr))
    ...: 
Out[46]: (array([2, 4]),)

事实证明，我高估了python和numpy之间的差异，虽然我在问题中提出的python代码非常慢，但可以使用预处理阵列更快地使其更快：

def python_parse(arr):
    result = np.empty(shape=(len(arr),), dtype=int)
    for i, x in enumerate(arr):
        try:
            result[i] = x
        except ValueError:
            raise Exception(f'Failed at: {i}')
    return result

这会正确产生错误，并且几乎与np.array(strings, dtype=int)一样快(这使我感到惊讶(。

我会做类似的事情：

custom_type=int
i = 0
l = ['10', '20', 'a']
acc = np.array([], dtype=custom_type)
for elem in l:
    try:
       acc = np.concatenate((acc, np.array([elem], dtype=custom_type)))
       i += 1
    except:
       print("Failed to convert the type of the element in position {}".format(i))

相关内容

最新更新

热门标签：