我正在尝试用Hadoop map-reduce
编写以下内容。我有一个日志文件,其中包含 IP 地址以及后面的相应 IP 打开的 URL。具体如下:
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
现在,我需要以这样一种方式组织此文件的结果,即它列出了不同的 IP 地址和 Url,后跟该 IP 打开该特定地址的次数。
例如,如果192.168.72.224
根据整个日志文件打开www.yahoo.com
15 次,则输出必须包含:
192.168.72.224 www.yahoo.com 15
这应该对文件中的所有 IP 完成,最终输出应如下所示:
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
我尝试过的代码是:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
我知道这段代码有严重的缺陷,请向我提出一个前进的想法。
谢谢。
我会提出这个设计:
- 映射器从文件中获取一行,并将IP作为键和一对网站和1作为值输出
- 合路器和减速器。获取 IP 作为键和一系列(网站、计数)对,按网站聚合它们(使用 HashMap),并将 IP、网站和计数作为输出输出。
实现这一点需要您实现自定义可写对象来处理一对 .
就我个人而言,我会用Spark来做这件事,除非你太在意性能。使用PySpark,它将像这样简单:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
示例的输出为:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
我在java中编写了相同的逻辑
public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
System.out.println(value);
StringTokenizer st=new StringTokenizer(value.toString());
if(st.hasMoreTokens())
contex.write(new Text(st.nextToken()), new Text(st.nextToken()));
}
}
public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashMap<String, Integer> urlCount=new HashMap<>();
String url=null;
Iterator<Text> it=values.iterator();
while (it.hasNext()) {
url=it.next().toString();
if(urlCount.get(url)==null)
urlCount.put(url, 1);
else
urlCount.put(url,urlCount.get(url)+1);
}
for(Entry<String, Integer> k:urlCount.entrySet())
context.write(key, new Text(k.getKey()+" "+k.getValue()));
}
}
public class UrlHitCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new UrlHitCount(), args);
}
public int run(String[] arg0) throws Exception {
Job job = Job.getInstance(getConf());
job.setJobName("url-hit-count");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(UrlHitMapper.class);
job.setReducerClass(UrlHitReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("input/urls"));
FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));
job.setJarByClass(WordCount.class);
job.submit();
return 1;
}
}