C语言 皮尔逊哈希8位实现产生非常不一致的值



我正在实现一个皮尔逊哈希,以便为一个C项目创建一个轻量级的字典结构,该项目需要一个文件名与文件数据配对的表-我想要哈希表的恒定搜索属性。我不是数学专家,所以我查找了一些好的文本散列,pearson出现了,它被认为是有效的,并且具有良好的分布。我测试了我的实现,发现无论我如何改变表大小或文件名最大长度,散列都非常低效,例如18/50桶是空的。我相信维基百科没有撒谎,是的,我知道我可以下载第三方哈希表实现,但我非常想知道为什么我的版本不起作用。

在下面的代码中,(向表中插入值的函数),是文件名,要散列的字符串,"是字符串的长度,"pData"是指向插入到表中的数据的指针,而&;ptable &;是表结构体。我在实验中发现初始条件cHash = cLen - csString[0]略微改善了均匀性。我应该补充说,我正在用完全随机化的字符串(使用rand()生成ascii值)测试表,其长度在一定范围内随机化-这是为了轻松生成和测试大量值。

typedef struct StaticStrTable {
unsigned int nRepeats;
unsigned char nBuckets;
unsigned char nMaxCollisions;
void** pBuckets;
} StaticStrTable;
static const char cPerm256[256] = {
227, 117, 238, 33, 25, 165, 107, 226, 132, 88, 84, 68, 217, 237, 228, 58, 52, 147, 46, 197, 191, 119, 211, 0, 218, 139, 196, 153, 170, 77, 175, 22, 193, 83, 66, 182, 151, 99, 11, 144, 104, 233, 166, 34, 177, 14, 194, 51, 30, 121, 102, 49,
222, 210, 199, 122, 235, 72, 13, 156, 38, 145, 137, 78, 65, 176, 94, 163, 95, 59, 92, 114, 243, 204, 224, 43, 185, 168, 244, 203, 28, 124, 248, 105, 10, 87, 115, 161, 138, 223, 108, 192, 6, 186, 101, 16, 39, 134, 123, 200, 190, 195, 178,
164, 9, 251, 245, 73, 162, 71, 7, 239, 62, 69, 209, 159, 3, 45, 247, 19, 174, 149, 61, 57, 146, 234, 189, 15, 202, 89, 111, 207, 31, 127, 215, 198, 231, 4, 181, 154, 64, 125, 24, 93, 152, 37, 116, 160, 113, 169, 255, 44, 36, 70, 225, 79,
250, 12, 229, 230, 76, 167, 118, 232, 142, 212, 98, 82, 252, 130, 23, 29, 236, 86, 240, 32, 90, 67, 126, 8, 133, 85, 20, 63, 47, 150, 135, 100, 103, 173, 184, 48, 143, 42, 54, 129, 242, 18, 187, 106, 254, 53, 120, 205, 155, 216, 219, 172,
21, 253, 5, 221, 40, 27, 2, 179, 74, 17, 55, 183, 56, 50, 110, 201, 109, 249, 128, 112, 75, 220, 214, 140, 246, 213, 136, 148, 97, 35, 241, 60, 188, 180, 206, 80, 91, 96, 157, 81, 171, 141, 131, 158, 1, 208, 26, 41
};
void InsertStaticStrTable(char* csString, unsigned char cLen, void* pData, StaticStrTable* pTable) {
unsigned char cHash = cLen - csString[0];
for (int i = 0; i < cLen; ++i) cHash ^= cPerm256[cHash ^ csString[i]];

unsigned short cTableIndex = cHash % pTable->nBuckets;
long long* pBucket = pTable->pBuckets[cTableIndex];

// Inserts data and records how many collisions there are - it may look weird as the way in which I decided to pack the data into the table buffer is very compact and arbitrary 
// It won't affect the hash though, which is the key issue!
for (int i = 0; i < pTable->nMaxCollisions; ++i) {
if (i == 1) {
pTable->nRepeats++;
}
long long* pSlotID = pBucket + (i << 1);
if (pSlotID[0] == 0) {
pSlotID[0] = csString;
pSlotID[1] = pData;
break;
}
}
}

供参考(这不是答案,我只是需要格式)这些只是模拟的单次运行,YMMV。

将50个元素随机分布在50个箱子中:


kalender_size=50 nperson = 50
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0:       18 (0.360000)        0 (0.000000)     0        0        0
1:       18 (0.360000)       18 (0.360000)     1       18       18
2:       10 (0.200000)       20 (0.400000)     3       30       48
3:        4 (0.080000)       12 (0.240000)     6       24       72
----+---------+--------+----------+--------+------+--------+--------
4:       50                  50                1.440000         72

类似地:在生日日历上分配365个人(忽略闰日…):


kalender_size=356 nperson = 356
E/cell| Ncell | frac   |  Nelem   |  frac  |h/cell|  hops  | Cumhops
----+---------+--------+----------+--------+------+--------+--------
0:      129 (0.362360)        0 (0.000000)     0        0        0
1:      132 (0.370787)      132 (0.370787)     1      132      132
2:       69 (0.193820)      138 (0.387640)     3      207      339
3:       19 (0.053371)       57 (0.160112)     6      114      453
4:        6 (0.016854)       24 (0.067416)    10       60      513
5:        1 (0.002809)        5 (0.014045)    15       15      528
----+---------+--------+----------+--------+------+--------+--------
6:      356                 356                1.483146        528

对于N个槽的N个项目,number of empty slotsnumber of slots with a single item in them的期望是相等的。两者的期望密度均为1/e。

最终数字(1.483146)是每个找到的元素(当使用链式哈希表时)的->next指针遍历的次数。任何最优哈希函数几乎都达到1.5。

最新更新