MapReduce查找单词长度频率



我是MapReduce的新手,我想问是否有人能给我一个使用MapReduce执行单词长度频率的想法。我已经有了单词计数的代码,但我想使用单词长度,这就是我目前所掌握的。

public class WordCount  {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }
}

}

谢谢。。。

对于字长频率,tokenizer.nextToken()不应作为key发射。实际上要考虑字符串的长度。因此,您的代码只需进行以下更改就可以了:

word.set( String.valueOf( tokenizer.nextToken().length() ));  

现在,如果您深入研究,您会意识到Mapper输出键不应该再是Text,尽管它可以工作。最好使用IntWritable密钥:

public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);
        }
    }
}

尽管大多数MapReduce示例都使用StringTokenizer,但使用String.split方法是更干净和可取的。因此,做出相应的改变。

相关内容

  • 没有找到相关文章

最新更新