大文件的桶排序/计数排序

我有一个非常大的文件，我想以粗略的方式对其进行排序(10 TB(。基本上，我对这个文件中的一个字段进行散列，取该散列的最后4位数字，并将其作为列附加。这给了我一个与每一行相关的以16为基数的4位数，这意味着每一行都可以放入65536个桶中的一个。然后，我想在65536个文件之间分发这个文件，每个文件代表其中一个bucket。

我认为GNU排序不够聪明，无法加快此操作的速度——我不能指定只有65536个可能的密钥，所以我的假设是，它将像处理任何其他排序操作一样处理此问题。

我目前的策略是打开65536个文件句柄，逐行遍历文件，将每一行写入相应的文件。这打破了单个用户的ulimit界限，我知道这是可以修改的，但我不确定这是否是一个好策略。以前有人做过这样的事吗？

现在我有一个python脚本，看起来像这样：

bucketfilemap = { ... } # 65536 open files
s = time.time()
with open(infile, 'rb') as inf:
for line in inf:
tokens = line.split(delim)
bucketkey = tokens[keyloc]
bucketfilemap[bucketkey].write(line)
e = time.time()
print("time total:", (e - s))

在我对较小文件的测试中，它比我想要的要慢，尽管它确实会随着文件的大小线性扩展，这正是我想要的。

我设计了一个C程序来代替python脚本。这是一个奇怪的发现，但最终速度要快得多。

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <fcntl.h>
#define N 65536
#define N2 4
// custom hex to integer
int hextoint(char* str) {
int t = 0;
for (int i = 0; i < N2; i++) {
char c = str[i];
if (c < 58) {
t |= ((int)(c-48))<<(i<<2); // each hex digits represents 4 bits, 16=2**4
} else {
t |= ((int)(c-87))<<(i<<2);
}
}
return t;
}
int bucketsort(char* infilename, char** outfilenames) {
FILE* buckets[N];
FILE* infile;
char lbuf[512];
int hashidx;
int cp;
int len;
for (int i = 0; i < N; i++) {
buckets[i] = fopen(outfilenames[i], "wb");
}
infile = fopen(infilename, "rb");
int j = 0;
while (1) {
cp = fgets(lbuf, sizeof(lbuf), infile);
if (cp == NULL) {
break;
}
len = strlen(lbuf);
hashidx = hextoint(lbuf); // file should be formatted such that the hex key always comes at the beginning of the line
fwrite(lbuf, 1, len, buckets[hashidx]);
j++;
}
for (int i = 0; i < N; i++) {
int r = fclose(buckets[i]);
}

return 0;
}
void main(int argc, char** argv) {
char* outfilesfilename = argv[2];
char* infilename = argv[1];
FILE* outfilefd = fopen(outfilesfilename, "r");
char lbuf[512];
char** outfilenames = malloc(sizeof(char*)*N);
for (int i = 0; i < N; i++) {
outfilenames[i] = malloc(sizeof(char)*256);
}
int i = 0;
while (fgets(lbuf, sizeof(lbuf), outfilefd)) {
int len = strlen(lbuf);
memcpy(outfilenames[i], lbuf, len-1);
i++;
}
bucketsort(infilename, outfilenames);
}

我发现fclose在C中可能非常慢，而且这个速度似乎取决于打开的文件描述符的数量。打开文件很快，写入文件也很快。但是，当我打开65536个文件时，执行65536 fcloss需要30-50秒。当我将N更改为256(将N2更改为2(时，只需要十分之一秒。

640MB file
N = 256
1.970000 seconds elapsed on write
0.110000 seconds elapsed on close
N = 65536
4.550000 seconds elapsed on write
36.869999 seconds elapsed on close

无论我写的是500KB还是640MB，关闭文件总是需要30-50秒，而python写500KB到65536个文件则需要0.24秒。这仍然更好，因为这似乎是一种恒定的成本，而不是随着文件大小而缩放的成本。例如，对于500KB的文件：

500KB file
0.440000 seconds elapsed on write
45.889999 seconds elapsed on close

相关内容

最新更新

热门标签：