如何将pdf注入elasticsearch



我将摄取附件处理器插件添加到Elastic。

然后创建一个非常简单的pdf文件。

这个文件(内容)我试图注入到Elastic。(参见下面的命令)

但是尝试从文件中找到一个单词失败。(参见命令末尾的第三个答案)

哪里出错了?

我需要添加一些管道吗?

pdf的PUT是否正确,是否需要将pdf内容设置到PUT命令的content字段中?

控制台命令…

1控制台:

PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}

1回答:

{
"acknowledged" : true
}

2控制台:

PUT my_index/_doc/001?pipeline=attachment
{
"filename": "C:\ELK-Stack\Test.pdf",
"data": "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Test Test Dokument umgewandelt von word zu pdf. Hier wird getestet. Das ist der Test."
},
"title": "Quick"
}

2答:

{
"_index" : "my_index",
"_id" : "001",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}

3控制台:

GET /my_index/_search 
{
"query": {
"match": {
"content": "Test"
}
}
}

3的答案:

{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

4控制台:

GET /_search
{
"query": {
"match_all": {}
}
}
4答:

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_id" : "001",
"_score" : 1.0,
"_source" : {
"filename" : """C:ELK-StackTest.pdf""",
"data" :       "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
"attachment" : {
"content_type" : "text/plain; charset=windows-1252",
"language" : "et",
"content" : """Test
Test Dokument umgewandelt von wo
Hier wird getestet. Das ist der Test""",
"content_length" : 77
},
"title" : "Quick"
}
}
]
}
}

Thanks toLeBigCat我找到解决办法了。

我需要添加完整路径到字段,

使用:"attachment.content";Test">

(而不是"content"Test"

GET /my_index/_search 
{
"query": {
"match": {
"attachment.content": "Test"
}
}
}

最新更新