我得到以下错误:
"AttributeError: 'list' object has no attribute 'similarity'"
当试图编译我的代码。我正在运行一个NLP Q& a管道与Haystack。我最近尝试实现松果矢量数据库,这导致了错误。
直到出现错误的管道基本上如下所示:初始化Pinecone数据存储->传递数据以转换为haystack兼容的文档->预处理文件->传递给干草堆密集通道寻回器。
为了简化情况,我收集了各种模块,并将所有代码放在一个可执行的python文件中,并在下面共享:
import logging
import os
from haystack.document_stores import InMemoryDocumentStore
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
import asyncio
import time
> #from kai.pinecone_system import initiate_pinecone
from haystack import Pipeline
from haystack.document_stores import PineconeDocumentStore
> ####REMOVE
def initiate_pinecone():
print("Testing PInecone")
ENV="eu-west1-gcp"
API="fake-api-key"
document_store = PineconeDocumentStore(
api_key=API,
index='esmo',
environment=ENV,
> )
return document_store
> ####REMOVE
LOGGING
logging.basicConfig(
format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING
> )
logging.getLogger("haystack").setLevel(logging.INFO)
DOC STORE
document_store = initiate_pinecone()
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs
DATA to DOCS
doc_dir = "data/esmo"
> #converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
> #doc_txt = converter.convert(file_path="data/esmo", meta=None)[0]
all_docs = convert_files_to_docs(dir_path=doc_dir)
PRE-PROCESSOR
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=150,
split_respect_sentence_boundary=True,
split_overlap=0
> )
processed_esmo_docs = preprocessor.process(all_docs)
print(f"n_files_input: {len(all_docs)}nn_docs_output: {len(processed_esmo_docs)}")
print(processed_esmo_docs[0])
write document objects into document store
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)
from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(
document_store=processed_esmo_docs,
> #document_store=all_docs,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query=64,
max_seq_len_passage=256,
batch_size=2,
use_gpu=True,
embed_title=True,
use_fast_tokenizers=True)
> ##INITIALIZE READER
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="michiyasunaga/BioLinkBERT-large", use_gpu=True)
> ##GET PIPELINE UP (RETRIEVER / READER)
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = ""
提前感谢您的建议。
我试过改变松果向量db余弦和点积。改变了预处理,也去掉了没有效果的预处理。我知道文档存储期望一个称为相似性的属性,但我不确定这到底是什么。
出于安全考虑,我修改了你的问题。
无论如何,我认为你的实例化不正确。
您可以在文档中看到,DensePassageRetriever.__init__
期望document_store
参数,该参数由要查询的文档存储组成;相反,您错误地使用了预处理文档。
您应该尝试以下检索器初始化:
retriever = DensePassageRetriever(
document_store=document_store,
...)