处理带括号的文件时出现 Hadoop 错误



我有很多不同的文件*.doc,*.pdf等等。我想用mapReduce处理它们。

我把它们放在HDFS中,然后使用Hue启动java MapReduce程序。

如果文件格式良好并且名称中没有括号"(){}[]",则一切正常。

但是如果有文件OPN_last_[age.PDF

我收到此错误:

    Failing Oozie Launcher, Main class [distr.fors.ru.Index], main() threw exception, Illegal file pattern: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    java.io.IOException: Illegal file pattern: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:70)
    at org.apache.hadoop.fs.GlobFilter.<init>(GlobFilter.java:49)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1670)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1627)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080)
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
    at distr.fors.ru.Index.run(Index.java:78)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at distr.fors.ru.Index.main(Index.java:39)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:495)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: java.util.regex.PatternSyntaxException: Unclosed character class near index 17
    OPN_last_[age.PDF
    ^
    at org.apache.hadoop.fs.GlobPattern.error(GlobPattern.java:167)
    at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:151)
    at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
    at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:66)
    ... 32 more

如果有这样的文件:{2011-01-27} (3769330).pdf

我收到这样的错误:

    Input Pattern hdfs://fd-bigdata.distr.fors.ru:8020/{2011-01-27} (3769330).pdf matches 0 files 
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) 
    t org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) 
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) 
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) 
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) 
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) 
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) 
    at distr.fors.ru.Index.run(Index.java:76) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at distr.fors.ru.Index.main(Index.java:37) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:495) 
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) 
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) 
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) 
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

我真的需要处理这样的文件。我能做些什么来解决这样的问题?

附言我使用的是最新的 CDH 4.4.0。

要处理 Java 中的特殊字符,您应该使用双反斜杠 '\' 转义它们:

'[' => '\['
'}' => '\}' 

这在Java,Pig和Oozie中对我有用。希望它也能解决您的问题。

相关内容

  • 没有找到相关文章

最新更新