hadoop负载平衡

我有多个不同的密钥以以下格式生成：

"71 1 2"、"69 2 3"、"68 5 6"等

但是，我发现这些对中的大多数都使用相同的减速器。

即使我实现了一个自定义分区器，我们在其中使用的getNumPartitioner方法hash_val%numReducers也大多返回值，这些值由几个加载它们的还原器组成，而其他还原器则保持空闲。，根据我的理解，我们可以使用WritableComparator对键进行排序，但不能控制键进入不同的减速器。

有没有办法改善负载平衡？请帮忙。

我在下面附上一些代码，以使我的解释清楚：

String a = "71 1 2";
String b = "72 1 1";
String c = "70 1 3";
int hash_a = a.hashCode();
int hash_b = b.hashCode();
int hash_c = c.hashCode();
int part_a = hash_a % 10;
int part_b = hash_b % 10;
int part_c = hash_c % 10;
System.out.println("hash a: "+hash_a+" part_a: "+part_a);
System.out.println("hash b: "+hash_b+" part_b: "+part_b);
System.out.println("hash c: "+hash_c+" part_c: "+part_c);

输出：

hash a:162085777 part_a:7hash b:16217800797 part_b:7hash c:1619933757 part_c:7

正如我们所看到的，不同的关键点往往映射到相同的减速器。

请帮忙！谢谢

首先，不能简单地进行java模数运算，因为有时哈希代码可能是负的，而且肯定没有所谓的负分区。所以你可能会取一个绝对值。

第二个是我在网上找到的一个强散列函数。它生成的不是普通的32位int，而是64位长。同样，这也受到了负分区问题的影响，但您可以自己纠正。

private static long[] byteTable;
private static final long HSTART = 0xBB40E64DA205B064L;
private static final long HMULT = 7664345821815920749L;
private static long[] createLookupTable() {
byteTable = new long[256];
long h = 0x544B2FBACAAF1684L;
for (int i = 0; i < 256; i++) {
  for (int j = 0; j < 31; j++) {
    h = (h >>> 7) ^ h;
    h = (h << 11) ^ h;
    h = (h >>> 10) ^ h;
  }
  byteTable[i] = h;
}
return byteTable;
}
public static long hash(String s) {
byte[] data = s.getBytes();
long h = HSTART;
final long hmult = HMULT;
final long[] ht = createLookupTable();
for (int len = data.length, i = 0; i < len; i++) {
  h = (h * hmult) ^ ht[data[i] & 0xff];
}
return h;
} 
public static void main(String[] args) {
String a = "71 1 2";
String b = "72 1 1";
String c = "70 1 3";
long hash_a = hash(a);
long hash_b = hash(b);
long hash_c = hash(c);
long part_a = hash_a % 10;
long part_b = hash_b % 10;
long part_c = hash_c % 10;
System.out.println("hash a: "+hash_a+" part_a: "+part_a);
System.out.println("hash b: "+hash_b+" part_b: "+part_b);
System.out.println("hash c: "+hash_c+" part_c: "+part_c);
}

看起来您有一个数据偏斜问题，您需要在分区器中更加聪明一点。你可以尝试的几件事：

Hadoop附带MurmurHash实现。您可以尝试在分区器中使用它来代替hashCode（），也许这会让您获得更均匀的分区
也许你需要超越哈希。有什么关于密钥是如何生成的，你可以利用它来获得更均匀的分发吗？例如，在键"71 1 2"上，你能在空间上划分，并用分区的数量来修改第一个数字（例如71）吗

您没有提到数据中的某些密钥是否真的重复。如果是这样，自定义组合器可能会有所帮助。

我不确定使用"更好"的哈希函数是否有帮助，因为不平衡的分布可能是由于您处理的数据的性质造成的。对于相同的输入，散列函数总是给出相同的输出。

相关内容

最新更新

热门标签：