深度NLP管道与Whoosh

我是NLP和IR程序的新手。我正在尝试实现深度NLP管道，即在句子的索引中添加Lemmatizing，依赖性解析功能。以下是我的模式和搜索器。

my_analyzer = RegexTokenizer()| StopFilter()| LowercaseFilter() | StemFilter() | Lemmatizer()
    pos_analyser = RegexTokenizer() | StopFilter()| LowercaseFilter() | PosTagger()
    schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=my_analyzer), pos_tag= pos_analyser)
for sentence in sent_tokenize_list1:
    writer.add_document(stem_text = sentence, pos_tag = sentence)
for sentence in sent_tokenize_list2:
    writer.add_document(stem_text = sentence, pos_tag = sentence)
writer.commit()
with ix.searcher() as searcher:
    og = qparser.OrGroup.factory(0.9)
    query_text = MultifieldParser(["stem_text","pos_tag"], schema = ix.schema, group= og).parse(
        "who is controlling the threat of locusts?")
     results = searcher.search(query_text, sortedby= scores, limit = 10 )

这是自定义分析仪。

class PosTagger(Filter):
    def __eq__(self, other):
        return (other
                and self.__class__ is other.__class__
                and self.__dict__ == other.__dict__)
    def __ne__(self, other):
        return not self == other
    def __init__(self):
         self.cache = {}
    def __call__(self, tokens):
         assert hasattr(tokens, "__iter__")
         words = []
         tokens1, tokens2 = itertools.tee(tokens)
         for t in tokens1:
            words.append(t.text)
         tags = pos_tag(words)
         i=0
         for t in tokens2:
             t.text = tags[i][0] + " "+ tags[i][1]
             i += 1
             yield t

我遇到以下错误。

whoosh.fields.fields.fieldconfigurationerror：compositeanalyzer（regextokenizer（expression = re.compile（' w （。？？ w ）*'）， gaps = false），stopfilter（stops = frozenset（{'for'，'，'will'，'tbd'，'with'with'， "one_answers"，'，'if'，'it'，'by'，'is'is'is''，''，''，'as'as'as'as'as'as'his "我们"，"或"，从"，"，"您"，"''，'can，'be'，'，''，''，''，''，'，''，'，''， " to"，" on"，" a"，" an"，" your"，" at"，" in"，''，''，'''''，''，'}），， min = 2，max = none，renumber = true），lowercasefilter（）， Postagger（cache = {}））不是fieldType对象

我做错了吗？这是将NLP管道添加到搜索引擎的正确方法吗？

pos_tag应直接分配给TEXT(stored= True, analyzer=pos_analyzer)的字段CC_2。

因此，在schema中，您应该有：

schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=my_analyzer), post_tag=TEXT(stored= True, analyzer=pos_analyzer))

相关内容

最新更新

热门标签：