(如何)我可以使用Apache Tika搜索.doc或.pdf或.java(等)文件的短语



当我搜索的驱动器被索引时,Windows 7搜索很少对我有效。

自从我发现Windows 7没有XP的"搜索狗",然后发现搜索几乎不可能,几乎完全不可靠(即,自2010年以来),我一直很沮丧,我用Java编写了自己的搜索程序Searchy

但是,虽然它允许复杂的文件名模式匹配(.DOC*, .PDF, .XL*, .TXT, .XML是合法输入),Searchy不能搜索CONTENTS文件中的单词和短语,如private protected .

我找到Apache Tika并下载了一个.jar例程文件,并将其导入到Netbeans 8.0.2中,以便下面提供的示例程序tika-example(有些令人惊讶)编译。

这个链接的宣传让我认为Apache Tika是我应该在Searchy中使用的:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

我不知道如何智能地使用它,但如果我能弄清楚如何处理一个文件,看看它是否包含特定的String,我想我将定位于使该过程在Searchy中工作,作为我创建的类中的一组方法。

tika-example

package org.apache.tika.example;
import java.io.File;
import org.apache.commons.io.FileUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.language.LanguageIdentifier;
import org.apache.tika.language.LanguageProfile;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
/**
 * Demonstrates how to call the different components within Tika: its
 * {@link Detector} framework (aka MIME identification and repository), its
 * {@link Parser} interface, its {@link LanguageIdentifier} and other goodies.
 */
public class MyFirstTika {
    public static void main(String[] args) throws Exception {
        String filename = "Test.Docx";//args[0];
        MimeTypes mimeRegistry = TikaConfig.getDefaultConfig()
                .getMimeRepository();
        System.out.println("Examining: [" + filename + "]");
        System.out.println("The MIME type (based on filename) is: ["
                + mimeRegistry.getMimeType(filename) + "]");
        System.out.println("The MIME type (based on MAGIC) is: ["
                + mimeRegistry.getMimeType(new File(filename)) + "]");
        Detector mimeDetector = (Detector) mimeRegistry;
        System.out
                .println("The MIME type (based on the Detector interface) is: ["
                        + mimeDetector.detect(new File(filename).toURI().toURL()
                                .openStream(), new Metadata()) + "]");
        LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(
                FileUtils.readFileToString(new File(filename))));
        System.out.println("The language of this content is: ["
                + lang.getLanguage() + "]");
        Parser parser = TikaConfig.getDefaultConfig().getParser(
                MediaType.parse(mimeRegistry.getMimeType(filename).getName()));
    Metadata parsedMet = new Metadata();
        ContentHandler handler = new BodyContentHandler();
        parser.parse(new File(filename).toURI().toURL().openStream(), handler,
                parsedMet, new ParseContext());
        System.out.println("Parsed Metadata: ");
        System.out.println(parsedMet);
        System.out.println("Parsed Text: ");
        System.out.println(handler.toString());
    }
}

当它编译时,我并不惊讶得到一个运行时错误:

run:
Examining: [Test.Docx]
The MIME type (based on filename) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on MAGIC) is: [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
The MIME type (based on the Detector interface) is: [application/octet-stream]
The language of this content is: [lt]
Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:123)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
    at org.apache.tika.example.MyFirstTika.main(MyFirstTika.java:56)
Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]
    at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:203)
    at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:275)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:73)
    ... 2 more
Java Result: 1

因为我得到了下面的错误,我提供了它打开的文件——Test.doc,其中有3行写着"Testing"。

Exception in thread "main" java.io.FileNotFoundException: C:UsersDovGoogle DriveNetBeansProjectstika-exampletikaExampleTest.Doc (The system cannot find the file specified)

我在文件夹C:UsersDovDownloadstika-1.9-srctika-1.9tika-example中找到了spring.xmlpom.xml,但不知道如何处理它们,如果有的话。

spring.xml:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
                           http://www.springframework.org/schema/beans/spring-beans-3.0.xsd">
<!--<start id="spring"/>-->
  <bean id="tika" class="org.apache.tika.parser.AutoDetectParser">
    <constructor-arg>
        <list>
           <ref bean="txt"/>
           <ref bean="pdf"/>
        </list>
    </constructor-arg>
  </bean>
  <bean id="txt" class="org.apache.tika.parser.txt.TXTParser"/>
  <bean id="pdf" class="org.apache.tika.parser.pdf.PDFParser"/>
<!--<end id="spring"/>-->
</beans>

pom.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <parent>
        <artifactId>tika-parent</artifactId>
        <groupId>org.apache.tika</groupId>
        <version>1.9</version>
        <relativePath>../tika-parent/pom.xml</relativePath>
      </parent>
      <modelVersion>4.0.0</modelVersion>
      <artifactId>tika-example</artifactId>
      <name>Apache Tika examples</name>
      <url>http://tika.apache.org/</url>
      <description>This module contains examples of how to use Apache Tika.</description>
      <organization>
        <name>The Apache Software Foundation</name>
        <url>http://www.apache.org</url>
      </organization>
      <scm>
        <url>http://svn.apache.org/viewvc/tika/tags/1.9-rc2/tika-example</url>
        <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</connection>
        <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.9-rc2/tika-example</developerConnection>
      </scm>
      <issueManagement>
        <system>JIRA</system>
        <url>https://issues.apache.org/jira/browse/TIKA</url>
      </issueManagement>
      <ciManagement>
        <system>Jenkins</system>
        <url>https://builds.apache.org/job/Tika-trunk/</url>
      </ciManagement>
      <!-- List of dependencies that we depend on for the examples. See the full list of Tika
           modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.-->
      <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-app</artifactId>
            <version>${project.version}</version>
            <exclusions>
              <exclusion>
                <artifactId>tika-parsers</artifactId>
                <groupId>org.apache.tika</groupId>
              </exclusion>
            </exclusions>
        </dependency>  
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-serialization</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-translate</artifactId>
          <version>${project.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.tika</groupId>
          <artifactId>tika-parsers</artifactId>
          <version>${project.version}</version>
          <type>test-jar</type>
          <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>javax.jcr</groupId>
            <artifactId>jcr</artifactId>
            <version>2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.jackrabbit</groupId>
            <artifactId>jackrabbit-jcr-server</artifactId>
            <version>2.3.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.jackrabbit</groupId>
            <artifactId>jackrabbit-core</artifactId>
            <version>2.3.6</version>
        </dependency>       
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>3.5.0</version>
        </dependency>   
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
            <version>3.0.2.RELEASE</version>
        </dependency>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </project>

任何帮助错误或如何处理Netbeans中的xml文件以使tika-example程序工作将不胜感激。

我想出了如何聪明地使用它。我让它为. doc、XLSX和. pdf文件是否包含给定的字符串提供正确的输出,因此显然不需要这两个xml文件。(使用原问题的导入)

    public class MyFirstTika {
      public static boolean contains(File file, String s) throws MalformedURLException, 
         IOException, MimeTypeException, SAXException, TikaException{
        ContentHandler handler = new BodyContentHandler();
        MimeTypes mimeRegistry = TikaConfig.getDefaultConfig().getMimeRepository();
        Detector mimeDetector = (Detector) mimeRegistry;
        LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(file)));
        Parser parser = TikaConfig.getDefaultConfig().getParser(MediaType.parse(mimeRegistry.getMimeType(file).getName()));
        Metadata parsedMet = new Metadata();
        parser.parse(file.toURI().toURL().openStream(), handler,parsedMet, new ParseContext());
        System.out.println("Handler:nn******" + handler + "nn*****" );
        return handler.toString().toLowerCase().contains(s.toLowerCase());
      }
      public static void main(String[] args) throws Exception 
      {
        String searchString = "champion";
        String filename = "schedule.pdf"; //test.docx";//"meds.xlsx";//Test2.Doc";
        File file = new File(filename);
        System.out.println(file + " contains " + searchString + ": " 
                + contains(file, searchString));
        }
    }
样本输出:

    Handler:
    ******
    DUBLIN YOUTH ATHLETICS
    Game Schedule  2014-2015
    Girls 6th-8th Grade League
    Dream
    Game Day Date Gym Time Home (White) Visitor (Green)
    1 Sunday 12/7/2014 Sells 4:00 PM Dream Sparks
    7 Sunday 12/14/2014 Sells 2:00 PM Fever Dream
    13 Sunday 1/4/2015 Sells 6:00 PM Stars Dream
    Championship 3/8/2015
    *****
    schedule.pdf contains champion: true

最新更新