多线程数组处理，然后写入为C-Python扩展的结果数组

以下代码是C-Python扩展。该代码采用连续raw字节的输入 buffer （对于我的应用程序，"块"原始字节的"块"，其中1个块= 128个字节），然后将这些字节处理为2个字节"示例"，将结果放在结果进入项目。返回的结构只是缓冲区处理到Python整数中。

这是两个主要功能：

uncack_block（itemp，items_offset，buffer，buffer_offset，samples_per_block，sample_bits）;

然后，一个循环通过项目中的每个样本，然后将每个样本转换为Python int。

pylistronget_item（结果，索引，pyint_fromlong（items [index]））;

    unsigned int num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread;
    unsigned int thread_id, p;
    unsigned int n_threads, start_index_bytes, start_index_blocks, start_index_samples;
    items = malloc(num_samples*sizeof(unsigned long));
    assert(items);
    #pragma omp parallel
    default(none)
    private(num_blocks_per_thread, num_samples_per_thread, num_bytes_per_thread, d, j, thread_id, n_threads, start_index_bytes, start_index_blocks, start_index_samples)
    shared(samples_per_block, num_blocks, buffer, bytes_per_block, sample_bits, result, num_samples, items)
      {
        n_threads = omp_get_num_threads();
        num_blocks_per_thread = num_blocks/n_threads;
        num_samples_per_thread = num_samples/n_threads; 
        num_bytes_per_thread = num_blocks_per_thread*samples_per_block*2/n_threads;
        thread_id = omp_get_thread_num();
        start_index_bytes = num_bytes_per_thread*thread_id;
        start_index_blocks = num_blocks_per_thread*thread_id;  
        start_index_samples = num_samples_per_thread*thread_id;
        for (d=0; d<num_blocks_per_thread; d++) {
          unpack_block(items, start_index_samples+d*samples_per_block, buffer, start_index_blocks + d*bytes_per_block, samples_per_block, sample_bits);
        }
      }
     result = PyList_New(num_samples);
     assert(result);
     //*THIS WOULD ALSO SEEM RIPE FOR MULTITHREADING*
     for (p=0; p<num_samples; p++) {
        PyList_SET_ITEM(result, p, PyInt_FromLong( items[p] ));
      }
    free(items);
    free(buffer);
  return result;
}

速度只是残酷的，远远低于我对多线程的期望。我可能有一个错误共享的问题，即将其写入项目数组的不同块，即使每个线程仅处理同一数组中的共同排除的块。

对我来说，一个基本的问题是：如何正确多线程进行单个数组的每元素处理，然后将结果每元素输出到第二个"结果"阵列中。我用我的两个功能两次执行此操作。

任何优化的想法，解决方案或方式都将是很棒的。谢谢！

您已经提到了错误共享。为了避免它，您必须相应地分配内存（使用POSIX_MEMALIGN或其他对齐的Alloc函数），还可以选择块大小，以便一个块的数据大小是缓存线大小的确切倍数。

通常，使用$ n $螺纹测量执行时间并计算加速。您可以与我们分享加速曲线吗？

关于"这似乎是多线程的成熟"的评论：通常，期望太高（只是一个警告以避免失望）。考虑您使用的每个线程数量/元素数量/元素，以及每个线程的工作负载（即每个项目需要多少个计算）。也许工作量很小，以至于OpenMP高架占主导地位。此外，每个内存负载操作需要多少个说明？通常，每个内存负载的许多指令都是合理的候选者，可以并行化。低比例表明该程序是内存绑定的。

说到内存访问，您是否在具有不同NUMA域的多插入系统上？如果是，您必须处理亲和力问题。

相关内容

最新更新

热门标签：