如何通过保存层次结构的列获得查询结果?这样的列:
type|postDate|discussionTitle|courses|subjectKeywords|SentiStrength|SentiWordNet|universities|universityKeywords|
我有一个大约1,000,000个JSON文档的Elasticsearch。我想与Python一起使用此数据集进行自然语言处理(NLP(。有人可以帮助我了解如何将alasticsearch的数据从Python获取python并将数据写回Elasticsearch。非常感谢它,因为我被困在我拥有的数据集上无法执行任何NLP,因为我无法与Python联系。这就是Elasticsearch的索引结构的样子:
我想在层次结构中输入一个新索引,就像"大学信息"称为"流程信息"这个新索引将基于我给出的一组关键字索引数据集 - 就像每个jason文件都应存储标签使用的一组关键字一样。我想将数据集标记为"过程信息" - 将4个标签或类别放在json文件名称,优惠,注册,注册,基于JSON文件中的关键字的要求和发布文本
"educationforumsenriched2": {
"mappings": {
"whirlpool": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"references": {
"type": "string"
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
},
"atarnotes": {
"properties": {
"CourseInfo": {
"properties": {
"courses": {
"type": "string",
"index": "not_analyzed"
},
"subjectKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"SentimentInfo": {
"properties": {
"SentiStrength": {
"type": "float"
},
"SentiWordNet": {
"type": "float"
}
}
},
"UniversityInfo": {
"properties": {
"universities": {
"type": "string",
"index": "not_analyzed"
},
"universityKeywords": {
"type": "string",
"index": "not_analyzed"
}
}
},
"discussionTitle": {
"type": "string"
},
"postDate": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"postID": {
"type": "integer"
},
"postText": {
"type": "string"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"threadID": {
"type": "integer"
},
"threadTitle": {
"type": "string"
}
}
}
}
}
}
这是我用来在Java中创建过程信息标签的代码 - 我想在Python中执行相同的操作
processMap.put("Applications", new ArrayList<>(Arrays.asList("apply", "applied", "applicant", "applying", "application", "applications")));
processMap.put("Offers", new ArrayList<>(Arrays.asList("offers", "offer", "offered", "offering")));
processMap.put("Enrollment", new ArrayList<>(Arrays.asList("enrolling","enroled","enroll", "enrolment", "enrollment","enrol","enrolled")));
processMap.put("Requirements", new ArrayList<>(Arrays.asList("requirement","requirements", "require")));
与elasticsearch python客户端,一旦建立了成功的连接,您只需要提供DSL查询和要搜索的索引即可检索所需的信息,例如如果您有查询:
GET educationforumsenriched2/_search
{
"query": {
"match" : {
"CourseInfo.subjectKeywords" : "foo"
}
}
}
python中的等效物是:
from elasticsearch import Elasticsearch
es = Elasticsearch({"host": "localhost", "port": 9200}) #many other settings are available if using https and so on
query = {
"query": {
"match" : {
"CourseInfo.subjectKeywords" : "foo"
}
}
}
res = es.search(index="educationforumsenriched2", body=query)
#do some processing
#create new document in ES
es.create(index="educationforumsenriched2", body=new_doc_after_processing)
编辑:只是考虑一下,但是如果您的处理不太复杂,您也可以考虑构建Ingest Pipeline