利用汉明距离算法在映射约简的帮助下建立模糊连接索引



我是Java和Map Reduce的新手,我正在尝试编写一个Map Reduce程序,该程序读取称为"字典"的列表单词,并使用汉明距离算法生成列表中距离为1的所有单词。我能够生成输出,但问题是它似乎非常低效,因为我需要将整个列表加载到ArrayList中,并且对于每个单词,我在Map方法中调用汉明距离,因此我读取整个列表两次并运行汉明距离算法n*n次,其中n是列表中的单词数。

你能建议我一些有效的方法吗?

是代码。到目前为止还没有减速机。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.StringTokenizer;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
public class MapJoin {

    public static class MyMapper extends Mapper<LongWritable,Text, Text, Text> {

        private List<String> Lst = new ArrayList<String>();

        protected void setup(Context context) throws java.io.IOException, InterruptedException{
            Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());

            for (Path p : files) {
                if (p.getName().equals("dictionary.txt")) {
                    BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
                    String line = reader.readLine();
                    while(line != null) {
                        String tokens = line.toString() ;
                        Lst.add(tokens);
                        line = reader.readLine();
                    }
                    reader.close();
                }
            }
            if (Lst.isEmpty()) {
                throw new IOException("Unable to load Abbrevation data.");
            }
        }

        public void map(LongWritable key, Text val, Context con)
                throws IOException,InterruptedException {
              String line1 = val.toString();
              StringTokenizer itr = new StringTokenizer(line1.toLowerCase()) ;
              while (itr.hasMoreTokens()) {
                  String key1 = itr.nextToken() ;  
                  String fnlstr = HammingDist(key1) ;
                      con.write(new Text(key1), new Text(fnlstr));
              }
            }

        private String HammingDist(String ky)
          {
              String result = "" ;
              for(String x :Lst)
              {
                  char[] s1 = ky.toCharArray();
                    char[] s2 = x.toCharArray();
                    int shorter = Math.min(s1.length, s2.length);
                    int longer = Math.max(s1.length, s2.length);
                    int distance = 0;
                    for (int i=0; i<shorter; i++) {
                        if (s1[i] != s2[i]) distance++;
                    }
                    distance += longer - shorter;
                    if (distance <2)
                    {
                        result = result +","+x ;
                    }
              }
              if(result == null) 
                  {
                  return "" ;
                  }
              else
              return result ;
          }
    }
  public static void main(String[] args) 
                  throws IOException, ClassNotFoundException, InterruptedException {
    Job job = new Job();
    job.setJarByClass(MapJoin.class);
    job.setJobName("MapJoin");
    job.setNumReduceTasks(0);
    try{
    DistributedCache.addCacheFile(new URI("/Input/dictionary.txt"), job.getConfiguration());
    }catch(Exception e){
        System.out.println(e);
    }
    job.setMapperClass(MyMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);

  }
}

对于您在map中找到的每个标记,调用HammingDist,它将迭代List<String> Lst并将每个项转换为char[]。最好将List<String> Lst替换为List<char[]>,然后把转换后的元素相加,而不是一遍又一遍地转换相同的单词

最新更新