Python 从 Elasticsearch 结果创建数据帧



我有来自 elasticsearch 的查询结果,格式如下:

[
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdf",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
},
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdf",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
},
]

我正在尝试将_source字段以及_id也加载到数据框中。

我试过这个:

def fetch_records_from_elasticsearch_index(index, filter_json):
search_param = prepare_es_body(filter_json_dict=filter_json)
response = settings.ES.search(index=index, body=search_param, size=10)
if len(response['hits']['hits']) > 0:
import pandas as pd
all_hits = response['hits']['hits']
# return all_hits
# export es hits to pandas dataframe
df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T
return df
else:
return 0

df只包含_source字段,但我也想_id字段添加到其中。

以下是 df 输出格式:

{
"AdminEdit": [
"False",
"False",
"False",
"False",        
],
"Group": [
"Grp2",
"Grp2",
"Grp2",
"Grp2"       
],
}

如何向其添加_id

有两种方法可以解决这个问题:

  1. 直接代码

    import pandas as pd
    df = pd.json_normalize(all_hits)
    
  2. 代码改进

    import json
    import pandas as pd
    df = pd.concat(map(pd.DataFrame.from_dict, all_hits), axis=1)['_source'].T
    df["_id"] = [i["_id"] for i in all_hits]
    

使用的 JSON 是:

all_hits = [
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdg",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
},
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdf",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
},
]

我试过这个:

response = '''
[
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdf",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
},
{
"_index": "product",
"_type": "_doc",
"_id": "23234sdf",
"_score": 2.2295187,
"_source": {
"SERP_KEY": "",
"r_variant_info": "",
"s_asin": "",
"pid": "394",
"r_gtin": "00838128000547",        
"additional_attributes_remarks": "publisher:0|size:0",            
"s_gtin": "",            
"r_category": "",
"confidence_score": "2.4545",      
"title_match": "45.45"
}
}
]
'''
from pandas.io import json as js
import json
data = json.loads(response)
df = js.json_normalize(data)
print(df.columns)

以下是您在最终数据帧中获得的列:

Index(['_id', '_index', '_score', '_source.SERP_KEY',
'_source.additional_attributes_remarks', '_source.confidence_score',
'_source.pid', '_source.r_category', '_source.r_gtin',
'_source.r_variant_info', '_source.s_asin', '_source.s_gtin',
'_source.title_match', '_type'],
dtype='object')

最新更新