使用qsort对c语言中可变长度字符串的多维数组进行排序

我有一个软件，可以生成一个相当大的文本文件，其中包含关于目录中文件的信息。通常有几千个文件。每一个都得到一组信息条目，看起来像:

number
number
IMPORTANT NUMBER
info
info
info
info
info

这些重复。对于目录中的每个文件，文本文件将具有相同的八行。

我应该排序这个文本文件的重要数字，值出现在第3行，然后3+8行，然后3+8 *2行，等等。

目前，我正在将它们读取到多维字符数组中，看起来像:

[number][number][IMPORTANT NUMBER 1][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 2][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 3][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 4][info][info][info][info][info]

等。

思路是按重要数字升序对每组8个条目进行排序。例如，如果我的数组看起来像这样:

[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number1][number1][1][info1][info1][info1][info1][info1]
[number4][number4][4][info4][info4][info4][info4][info4]

排序后，看起来像:

[number1][number1][1][info1][info1][info1][info1][info1]
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number4][number4][4][info4][info4][info4][info4][info4]

…使用arr[2](1,2,3,4…)中的值进行排序。

问题是存储在其他列中的信息往往大小不同。arr[3]的长度可能为30个字符。arr[4]的长度可能超过5000。对于大量的数据，这样做的速度很快，我不想只是分配最大长度可能是多少的集合大小，特别是当我在大多数情况下每次只使用它的一小部分时。

我找到了很多关于使用qsort的好答案，但是很少有关于排序大型多维字符串数组的答案。我也想使用qsort，因为我不想重新发明轮子，我怀疑我写的任何东西都不会如此高效。

如果有人能告诉我如何做到这一点，我将不胜感激。

当前代码为:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define FIELDS 8
int compare(const void *row1, const void *row2);
int main(int argc, char *argv[])
{
    // (1) - Open File
    const char fname[] = "arrayFile.txt";
    FILE *fp = fopen(fname, "r");
    printf("Opened file: %sn", fname); 
    // (2) - Count Lines
    char cr;
    size_t lines = 0;
    while (cr != EOF)
    {
        if (cr == 'n') 
        {
            lines++;
        }
        cr = getc(fp);
    } 
    rewind(fp);
    // (3) - Populate Array
    char *data[lines / FIELDS][FIELDS];
    lines = lines / FIELDS;
    size_t n;
    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            data[i][j] = NULL;
            size_t n = 0;
            getline(&data[i][j], &n, fp);
        }    
    }
    // (4) - Print Array Before
    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            printf("%s", data[i][j]);
        }
    }
    printf("nnNot sortednn");
    // (5) - Sort Array
    qsort(data, lines, sizeof(data[0]), compare);
    printf("nnsortednn");
    // (6) - Print Array After
    for (int i = 0; i < lines; i++) 
    {
        for (int j = 0; j < FIELDS; j++)
        {
            printf("%s", data[i][j]);
            free(data[i][j]);
        }
    }
    // Close File
    fclose(fp);
    printf("nnNumber of files: %ldn", lines);
    printf("nnNumber of lines: %ldn", lines * FIELDS);
    return 0;
}
int compare(const void *row1, const void *row2)
{
    const char *(*a)[8] = row1;
    const char *(*b)[8] = row2;
    return strcmp((*a)[2], (*b)[2]);
}

不幸的是(并且可以预见)，这会在排序期间产生分段错误。我估计这是由于我如何处理指针和索引，但确切的原因是逃避我。

这似乎是一件非常有用的事情，知道如何在未来做得更好，但它比我个人之前在C中尝试做的数组和指针要多一点。

提前感谢。

编辑:对于感兴趣的各方，上面的代码，虽然没有优化，至少是功能性的。对于可能的改进建议，请参阅这里的答案。

SECOND EDIT:事实证明，通过system()命令对它进行排序也是一个可行的选择，并且占用更少的内存空间。这可以通过以下命令来完成:

char cmd[50];
sprintf(cmd, "sort -o destinationFileName sourceFileName");
system(cmd);

系统可以为您做这些。

你的代码中有多个问题:

您应该测试fopen可能无法打开文件。
char cr;应该是int cr;，以处理getc()返回的所有257个可能的值，假设是8位字节。
cr在while (cr != EOF)的第一次迭代期间未初始化。你应该这样写这个循环:
```
  int cr;
  while ((cr = getc(fp)) != EOF) {
      lines += (cr == 'n');
  }
```
如chux所记录的，读取整个文件的初始传递是不必要的，您应该在读取文件时重新分配数组。
char *data[lines / FIELDS][FIELDS];可能定义了一个对于自动存储来说太大的数组，导致堆栈溢出
size_t的正确格式说明符是%zu，而不是%ld。size_t不是long，甚至可能没有相同的大小或参数传递约定。
compare函数在类型转换中使用了太多的间接。虽然除了const的正确性之外，类型转换可能还可以，但对于大多数程序员来说，它们很难掌握。您应该使用更简单的方法:
```
int compare(const void *row1, const void *row2) {
     char * const *a = row1;
     char * const *b = row2;
     return strcmp(a[2], b[2]);
}
```

注意，上面的函数将按字典顺序对重要的数字进行排序，将11放在1和2之间。您可能需要数字顺序:

int compare(const void *row1, const void *row2) {
     char * const *a = row1;
     char * const *b = row2;
     long na = strtol(a, NULL, 10);
     long nb = strtol(b, NULL, 10);
     return (na > nb) - (na < nb);
}

很多问题——我只提到一个

计算行

不计算行数。(放弃第二步。)而不是进行2次传递，使用1次传递，而不是根据需要调整data。

一些未经测试的代码给OP一个想法:

  char *(*data)[FIELDS] = NULL;
  size_t records_n = 0;  // Allocation total
  size_t records_i;      // Allocation used
  for (records_i = 0; records_i < SIZE_MAX; records_i++) {
    if (records_i == records_n) {
      size_t records_new_n = records_n * 2 + 1;  // Double the allocation
      char *(*newdata)[FIELDS] = realloc(data, sizeof data[0] * records_new_n);
      if (newdata == NULL) {
        free(data);
        fprintf(stderr, "Out of memory.n");
        exit(EXIT_FAILURE);
      }
      data = newdata;
      records_n = records_new_n;
    }
    int f;
    for (f = 0; f < FIELDS; f++) {
      data[records_i][f] = NULL;
      size_t n = 0;
      if (getline(&data[records_i][f], &n, fp) == -1) {
        if (f == 0) {
          break;
        }
        fprintf(stderr, "Record ended early.n");
        break; // Or maybe fail?
      }
      // Lop off potential 'n'
      if (n > 0 && data[records_i][f][n - 1] == 'n') {
        data[records_i][f][--n] = 0;
      }
    }
    if (f < FIELDS) {
      break;
    }
  }
  // Perhaps right-size data to records_i here?  Not shown.
  // ... Use data
  // When all done, free all lines allocated (not shown) and ...
  free(data);

相关内容

最新更新

热门标签：