将数据从elasticsearch-json文件中获取到python



如何通过保存层次结构的列获得查询结果?这样的列:

type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|

我有一个大约1,000,000个JSON文档的Elasticsearch。我想与Python一起使用此数据集进行自然语言处理(NLP(。有人可以帮助我了解如何将alasticsearch的数据从Python获取python并将数据写回Elasticsearch。非常感谢它,因为我被困在我拥有的数据集上无法执行任何NLP,因为我无法与Python联系。这就是Elasticsearch的索引结构的样子:
我想在层次结构中输入一个新索引,就像"大学信息"称为"流程信息"这个新索引将基于我给出的一组关键字索引数据集 - 就像每个jason文件都应存储标签使用的一组关键字一样。我想将数据集标记为"过程信息" - 将4个标签或类别放在json文件名称,优惠,注册,注册,基于JSON文件中的关键字的要求和发布文本

中的要求
 "educationforumsenriched2": {
          "mappings": {
             "whirlpool": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "references": {
                      "type": "string"
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             },
             "atarnotes": {
                "properties": {
                   "CourseInfo": {
                      "properties": {
                         "courses": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "subjectKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "SentimentInfo": {
                      "properties": {
                         "SentiStrength": {
                            "type": "float"
                         },
                         "SentiWordNet": {
                            "type": "float"
                         }
                      }
                   },
                   "UniversityInfo": {
                      "properties": {
                         "universities": {
                            "type": "string",
                            "index": "not_analyzed"
                         },
                         "universityKeywords": {
                            "type": "string",
                            "index": "not_analyzed"
                         }
                      }
                   },
                   "discussionTitle": {
                      "type": "string"
                   },
                   "postDate": {
                      "type": "date",
                      "format": "strict_date_optional_time||epoch_millis"
                   },
                   "postID": {
                      "type": "integer"
                   },
                   "postText": {
                      "type": "string"
                   },
                   "query": {
                      "properties": {
                         "match_all": {
                            "type": "object"
                         }
                      }
                   },
                   "threadID": {
                      "type": "integer"
                   },
                   "threadTitle": {
                      "type": "string"
                   }
                }
             }
          }
       }
    }

这是我用来在Java中创建过程信息标签的代码 - 我想在Python中执行相同的操作

 processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
        processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
        processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
        processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));

与elasticsearch python客户端,一旦建立了成功的连接,您只需要提供DSL查询和要搜索的索引即可检索所需的信息,例如如果您有查询:

GET educationforumsenriched2/_search
{
    "query": {
        "match" : {
            "CourseInfo.subjectKeywords" : "foo"
        }
    }
}

python中的等效物是:

from elasticsearch import Elasticsearch
es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on
query = {
        "query": {
            "match" : {
                "CourseInfo.subjectKeywords" : "foo"
            }
        }
    }
res = es.search(index="educationforumsenriched2", body=query)
#do some processing
#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)

编辑:只是考虑一下,但是如果您的处理不太复杂,您也可以考虑构建Ingest Pipeline

最新更新