我有一个日志文件,如下
Begin ... 12-07-2008 02:00:05 ----> record1
incidentID: inc001
description: blah blah blah
owner: abc
status: resolved
end .... 13-07-2008 02:00:05
Begin ... 12-07-2008 03:00:05 ----> record2
incidentID: inc002
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc
status: resolved
end .... 13-07-2008 03:00:05
我想使用mapreduce来处理这个。我想提取事件ID、状态以及事件所花费的时间
如何处理这两个记录,因为它们具有可变的记录长度,以及如果在记录结束之前发生输入拆分,该怎么办。
您需要编写自己的输入格式和记录读取器,以确保在记录分隔符周围进行正确的文件拆分。
基本上,你的记录阅读器需要寻找它的分割字节偏移量,向前扫描(读取行),直到找到其中一个:
Begin ...
线- 读取直到下一个
end ...
行的行,并在开始和结束之间提供这些行作为下一个记录的输入
- 读取直到下一个
- 它扫描超过分割结束或找到EOF
这在算法上类似于Mahout的XMLInputFormat如何处理多行XML作为输入——事实上,您可以直接修改源代码来处理您的情况。
正如@irW的回答中所提到的,如果您的记录每个记录有固定的行数,那么NLineInputFormat
是另一个选项,但对于较大的文件来说效率非常低,因为它必须打开并读取整个文件才能发现输入格式的getSplits()
方法中的行偏移。
在您的示例中,每条记录都有相同数量的行。如果是这种情况,您可以使用NLinesInputFormat,如果无法知道行数,则可能会更加困难。(有关NlinesInputFormat的更多信息:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html)