来自HTable的MapReduce输入

我有一个MapReduce作业，输入来自HTable。在Java MapReduce代码中，如何将Job inputformat设置为HBase的TableInputFormat?

有什么像JDBC连接连接到HTable数据库?

如果你的客户端和HBase在同一台机器上运行，你不需要为你的客户端配置任何东西来与HBase通信。只需创建一个HBaseConfiguration实例并连接到HTable:

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "TABLE_NAME");

但是如果你的客户端运行在远程机器上，它依赖于ZooKeeper来与你的HBase集群通信。因此，客户端在继续之前需要ZooKeeper集合的位置。为了让客户端连接到HBase集群，我们通常是这样配置客户端的:

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "ZK_MACHINE_IP/HOSTNAME");
conf.set("hbase.zookeeper.property.clientPort","2181");
HTable table = new HTable(conf, "TABLE_NAME");

这是通过Java API实现的。HBase还支持其他一些api。你可以在这里找到更多信息。

回到你的第一个问题，如果你需要在你的MR作业中使用TableInputFormat作为InputFormat，你可以通过job对象来做，像这样:

job.setInputFormatClass(TableInputFormat.class);

希望这能回答你的问题

HBase提供了一个TableMapResudeUtil类，可以方便地设置map/reduce作业以下是手册中的第一个示例:

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

相关内容

最新更新

热门标签：