如何忽略Lucene Query(休眠搜索)中的某些字符



我已经索引了这个实体

@Entity
@Indexed
public class MyBean {
    @Id
    private Long id;
    @Field
    private String foo;
    @Field
    private String bar;
    @Field
    private String baz;
}

对于此架构:

+----+-------------+-------------+-------------+
| id |     foo     |     bar     |     baz     |
+----+-------------+-------------+-------------+
| 11 | an example  | ignore this | ignore this |
| 12 | ignore this | an e.x.a.m. | ignore this |
| 13 | not this    | not this    | not this    |
+----+-------------+-------------+-------------+ 

我需要通过搜索exam来找到1112

我尝试过:

FullTextEntityManager fullTextEntityManager = 
    Search.getFullTextEntityManager(this.entityManager);
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory()
    .buildQueryBuilder().forEntity(MyBean.class).get();
Query textQuery = queryBuilder.keyword()
    .onFields("foo", "bar", "baz").matching("exam").createQuery();
fullTextEntityManager.createFullTextQuery(textQuery, MyBean.class).getResultList();

但这只找到实体11,我还需要12.这种可能吗?

带有 CATENATE_ALL 标志的WordDelimiterFilter添加到分析链中,将是一个可能的解决方案。

因此,基于 StandardAnalyzer 的分析器实现如下所示:

public class StandardWithWordDelim extends StopwordAnalyzerBase{
    public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET; 
    public StandardWithWordDelim() {
    }
    @Override
    protected TokenStreamComponents createComponents(final String fieldName) {
        StandardTokenizer src = new StandardTokenizer();
        src.setMaxTokenLength(255);
        TokenStream filter = new StandardFilter(src);
        filter = new LowerCaseFilter(filter);
        filter = new StopFilter(filter, stopwords);
        //I'm inclined to add it here, so the abbreviation "t.h.e." doesn't get whacked by the StopFilter.
        filter = new WordDelimiterFilter(filter, WordDelimiterFilter.CATENATE_ALL, null);
        return new TokenStreamComponents(src, filter);
    }
}

看起来您不像正在使用标准分析器(也许是NGrams?),但您应该能够在某处将其纳入分析中。

最新更新