SOLR/LUCENE专家，请帮我设计一个简单的关键词搜索从PDF索引

我尝试过solr，但无法找到适合我需求的方法。

我有什么:

一堆PDF文件。一组关键字。

我想要达到的目标:

索引PDF文件(solrcell - done)搜索关键字(工作正常)定制输出以输出PDF文件的名称，关键字出现的摘录(不知道如何)

尝试操作ResponseHandler/Schema.xml/Solrconfig.xml无效。

Lucene/solr专家，你认为我正在努力实现的是可能的吗?

我把现有的代码放在github @ https://github.com/ThinkCode/solr_search上(这主要是solr的默认示例，对字段进行了轻微修改(所有内容都存储在一个内容字段中)。

schema.xml的显著变化如下:

Schema.xml:

<solrQueryParser defaultOperator="AND"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
<solrQueryParser defaultOperator="AND"/>
<copyField source="*" dest="content"/>

电流输出:

(查询)http://localhost: 8983/solr/选择/? q = Java + Servlet&版本= 2.2,= 0开始,行= 10,缩进=在

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>
<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

我要找的是"提取的片段(行)，其中关键字被发现"。

在提供的查询中，我搜索'Java Servlet'，它返回文档。我对在输出xml中返回的上下文"Solr可以在您选择的任何Java Servlet容器中运行"感兴趣。

要获取匹配关键字周围的文本片段，请参见http://wiki.apache.org/solr/HighlightingParameters

要将索引PDF的文件名作为响应的一部分，只需添加一个包含该信息的字段(它应该是一个字符串字段，未索引，存储)。当然，您必须在索引时填充这个新字段。

使用PDF Box和Apache Lucene的独立解决方案可在:* https://github.com/WolfgangFahl/pdfindexer它将创建一个HTML文件，其中包含指向PDF文件中找到关键字的相应页面的链接。

相关内容

最新更新

热门标签：