如何在mapreduce的帮助下处理pdf文件

该任务使用hadoop mapreduce解析许多pdf。我认为整个过程应该只在映射器中。从哪里开始？映射器的外观如何？

同意您的看法，解析过程可以在Mapper部分完成，Reducer仅输出结果，而不需要任何聚合计算。

以广泛使用的MapReduce框架Hadoop为例，您需要使用Writable定义自己的数据类型，假设将其命名为MyPdfFile，每个MyPdfFile实例代表一个PDF文件，它包含输入的PDF文件内容，可能还有其他信息。MyPdfFile应该包括一个将PDF文件内容转换为文本的方法getConvertedText，关于如何通过Java处理PDF文件，请尝试ApachePDFBox。

那么Mapper可能看起来像：

class PdfToTxtMapper extends Mapper<Text, MyPdfFile, Text, Text> {
  @Override
  public void map(Text inputKey, PdfFile inputValue, Context context) throws IOException, InterruptedException {
    Text outputKey = new Text(inputKey);
    Text outputVal = inputValue.getConvertedText(inputValue);
    context.write(outputKey, outputVal);
  }
}

希望能有所帮助。

相关内容

最新更新

热门标签：