我有以下几行代码
from haystack.document_stores import InMemoryDocumentStore, SQLDocumentStore
from haystack.nodes import TextConverter, PDFToTextConverter,PreProcessor
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers
doc_dir = "C:\Users\abcd\Downloads\PDF Files\"
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=None, split_paragraphs=True
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="passage",
split_length=2)
doc = preprocessor.process(docs)
当我尝试运行它时,我得到以下错误信息
NotImplementedError Traceback (most recent call last)
c:UsersabcdDownloadssolr9.ipynb Cell 27 in <cell line: 23>()
16 print(type(docs))
17 preprocessor = PreProcessor(
18 clean_empty_lines=True,
19 clean_whitespace=True,
20 clean_header_footer=True,
21 split_by="passage",
22 split_length=2)
---> 23 doc = preprocessor.process(docs)
File ~AppDataRoamingPythonPython39site-packageshaystacknodespreprocessorpreprocessor.py:167, in PreProcessor.process(self, documents, clean_whitespace, clean_header_footer, clean_empty_lines, remove_substrings, split_by, split_length, split_overlap, split_respect_sentence_boundary, id_hash_keys)
165 ret = self._process_single(document=documents, id_hash_keys=id_hash_keys, **kwargs) # type: ignore
166 elif isinstance(documents, list):
--> 167 ret = self._process_batch(documents=list(documents), id_hash_keys=id_hash_keys, **kwargs)
168 else:
169 raise Exception("documents provided to PreProcessor.prepreprocess() is not of type list nor Document")
File ~AppDataRoamingPythonPython39site-packageshaystacknodespreprocessorpreprocessor.py:225, in PreProcessor._process_batch(self, documents, id_hash_keys, **kwargs)
222 def _process_batch(
223 self, documents: List[Union[dict, Document]], id_hash_keys: Optional[List[str]] = None, **kwargs
224 ) -> List[Document]:
--> 225 nested_docs = [
226 self._process_single(d, id_hash_keys=id_hash_keys, **kwargs)
...
--> 324 raise NotImplementedError("'split_respect_sentence_boundary=True' is only compatible with split_by='word'.")
326 if type(document.content) is not str:
327 logger.error("Document content is not of type str. Nothing to split.")
NotImplementedError: 'split_respect_sentence_boundary=True' is only compatible with split_by='word'.
我甚至没有将split_respect_sentence_boundary=True
作为我的参数,也没有将split_by='word'
设置为split_by="passage"
。
如果我尝试将其更改为split_by="sentence"
,这是相同的错误。
如果我在这里错过了什么,请告诉我。
尝试使用split_by="sentence"
,但得到相同的错误。
在PreProcessor API文档中可以看到,split_respect_sentence_boundary
的默认值是True
。
为了使你的代码工作,你应该指定split_respect_sentence_boundary=False
:
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="passage",
split_length=2,
split_respect_sentence_boundary=False)
我同意这种行为不直观。当前,该节点正在进行重大重构。