一种在h5py中用相同值的复合数据填充数据集的快速方法

我在hdf文件中有一个大型的复合数据集。复合数据的类型如下所示：

    numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
                 ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])

用它，我创建了一个引用图像的数据集，并在每个位置创建了另一个数据集。这些数据集的维度为n x n，其中n通常至少为256，但更有可能>2000。我必须首先用相同的值填充这些数据集的每个位置：

    [[(image.ref, dataset.ref)...(image.ref, dataset.ref)],
      .
      .
      .
     [(image.ref, dataset.ref)...(image.ref, dataset.ref)]]

我尽量避免用两个for循环来填充它，比如：

    for i in xrange(0,n):
      for j in xrange(0,n):
         daset[i,j] =(image.ref, dataset.ref)

因为表现非常糟糕。所以我在搜索numpy.fill、numpy.shape、numpy.reshape、numpy.array、numpy.arrange、[:]等等。我尝试了各种方法，但它们似乎都只适用于数字和字符串数据类型。有没有办法以比for循环更快的方式填充这些数据集？

提前谢谢。

您可以使用numpy广播或numpy.repeat和numpy.reshape:的组合

my_dtype = numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
             ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])
ref_array = array( (image.ref, dataset.ref), dtype=my_dtype)
dataset = numpy.repeat(ref_array, n*n)
dataset = dataset.reshape( (n,n) )

注意，numpy.repeat返回一个扁平的数组，因此使用了numpy.reshape。repeat似乎比广播更快：

%timeit empty_dataset=np.empty(2*2,dtype=my_dtype); empty_dataset[:]=ref_array
100000 loops, best of 3: 9.09 us per loop
%timeit repeat_dataset=np.repeat(ref_array, 2*2).reshape((2,2))
100000 loops, best of 3: 5.92 us per loop

相关内容

最新更新

热门标签：