如何在hadoop中管理join - MultipleInputPath - How to manage Joins in hadoop

在地图端加入我在Reducer中获得的数据后

key------ book
values
    6
    eraser=>book 2
    pen=>book 4
    pencil=>book 5

我要做的就是

eraser=>book = 2/6
pen=>book = 4/6
pencil=>book = 5/6

我最初做的是

public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
        System.out.println("key------ "+key);
        System.out.println("Values");
        for(Text value : values){
            System.out.println("t"+value.toString());
            String v = value.toString();
            double BsupportCnt = 0;
            double UsupportCnt = 0;
            double res = 0;
            if(!v.contains("=>")){
                BsupportCnt = Double.parseDouble(v);
            }
            else{
                String parts[] = v.split(" ");
                UsupportCnt = Double.parseDouble(parts[1]);
            }
//          calculate here
            res = UsupportCnt/BsupportCnt;
        }

如果输入的数据如上所述，则此操作正常

但是如果来自mapper的传入数据是

key------ book
values
    eraser=>book 2
    pen=>book 4
    pencil=>book 5
    6

这行不通或者我需要将所有=>存储在一个列表中(如果传入的数据是大数据，列表可能会占用堆空间)，一旦我得到一个数字，我应该进行计算。

由于Vefthym要求在到达减速器之前对值进行二次排序。我用htuple做了同样的事情。我引用了这个链接

在mapper1中以eraser=>book 2作为值所以

public class AprioriItemMapper1 extends Mapper<Text, Text, Text, Tuple>{
    public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
        //Configurations and other stuffs
        //allWords is an ArrayList
        if(allWords.size()<=2)
        {
            Tuple outputKey = new Tuple();
            String LHS1 = allWords.get(1);
            String RHS1 = allWords.get(0)+"=>"+allWords.get(1)+" "+value.toString();
            outputKey.set(TupleFields.ALPHA, RHS1);
            context.write(new Text(LHS1), outputKey);
                 }
//other stuffs

Mapper2发出numbers作为值

public class AprioriItemMapper2 extends Mapper<Text, Text, Text, Tuple>{
    Text valEmit = new Text(); 
    public void map(Text key,Text value,Context context) throws IOException, InterruptedException{
        //Configuration and other stuffs
        if(cnt != supCnt && cnt < supCnt){
            System.out.println("emit");
            Tuple outputKey = new Tuple();
            outputKey.set(TupleFields.NUMBER, value);
            System.out.println("v---"+value);
            System.out.println("outputKey.toString()---"+outputKey.toString());
            context.write(key, outputKey);
        }

Reducer我只是尝试打印键和值

但是这个捕获错误

Mapper 2: 
line book
Support Count: 2
count--- 1
emit
v---6
outputKey.toString()---[0]='6, 
14/08/07 13:54:19 INFO mapred.LocalJobRunner: Map task executor complete.
14/08/07 13:54:19 WARN mapred.LocalJobRunner: job_local626380383_0003
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.htuple.Tuple
    at org.htuple.TupleMapReducePartitioner.getPartition(TupleMapReducePartitioner.java:28)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:601)
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
    at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:49)
    at edu.am.bigdata.apriori.AprioriItemMapper1.map(AprioriItemMapper1.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
    at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:51)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)

错误从AprioriItemMapper1.java:49到context.write(new Text(LHS1), outputKey);但以上打印细节来自Mapper 2

还有更好的方法吗请建议。

我建议使用二级排序，这将保证第一个值(按字典顺序排序)是一个数字，假设没有以数字开头的单词。

如果这不能工作，那么，承受您提到的可伸缩性限制，我会将减速器的值存储在HashMap<String,Double>缓冲区中，键是"=>"的左侧部分，值是它们的数值。您可以存储这些值，直到得到分母BsupportCnt的值。然后，您就可以发出具有正确分数和所有剩余值的所有缓冲区内容，因为它们一个接一个地出现，而不需要再次使用缓冲区(因为您现在知道了分母)。像这样:

public void reduce(Text key,Iterable<Text> values , Context context) throws IOException, InterruptedException{
    Map<String,Double> buffer = new HashMap<>();
    double BsupportCnt = 0;
    double UsupportCnt;
    double res;
    for(Text value : values){
        String v = value.toString();
        if(!v.contains("=>")){
            BsupportCnt = Double.parseDouble(v);
        } else {
            String parts[] = v.split(" ");
            UsupportCnt = Double.parseDouble(parts[1]);
            if (BsupportCnt != 0) { //no need to add things to the buffer any more
               res = UsupportCnt/BsupportCnt;
               context.write(new Text(v), new DoubleWritable(res));
            } else {
               buffer.put(parts[0], UsupportCnt);
            }
        }
    }

    //now emit the buffer's contents
    for (Map<String,Double>.Entry entry : buffer) {
        context.write(new Text(entry.getKey()), new DoubleWritable(entry.getValue()/BsupportCnt));
    }
}

您可以通过仅存储"=>"的左侧部分作为HashMap的键来获得更多的空间，因为右侧部分始终是reducer的输入键。

如何在hadoop中管理join - MultipleInputPath

相关内容

最新更新

热门标签：