我是MapReduce的新手,我想问是否有人能给我一个使用MapReduce执行单词长度频率的想法。我已经有了单词计数的代码,但我想使用单词长度,这就是我目前所掌握的。
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
谢谢。。。
对于字长频率,tokenizer.nextToken()
不应作为key
发射。实际上要考虑字符串的长度。因此,您的代码只需进行以下更改就可以了:
word.set( String.valueOf( tokenizer.nextToken().length() ));
现在,如果您深入研究,您会意识到Mapper
输出键不应该再是Text
,尽管它可以工作。最好使用IntWritable
密钥:
public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable wordLength = new IntWritable();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
wordLength.set(tokenizer.nextToken().length());
context.write(wordLength, one);
}
}
}
尽管大多数MapReduce
示例都使用StringTokenizer
,但使用String.split
方法是更干净和可取的。因此,做出相应的改变。