计算Elasticsearch中2个字段组合的不同值的确切计数

我的elasticsearch索引中有大约4000万条记录。我想计算两个字段组合的不同值的计数。

给定文档集的示例:

[
{
"JobId" : 2,
"DesigId" : 12
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
}
]

对于上面的例子，我应该得到count = 3

因为只有3个不同的值存在:[(12),(2、4),(3、5)]我尝试使用基数聚合，但它提供了一个近似计数．我想准确地计算的确切计数。

下面是我使用基数聚合使用的查询:

"aggs": {
"counts": {
"cardinality": {
"script": "doc['JobId'].value + ',' + doc['DesigId'].value",
"precision_threshold": 40000
}
}
}

我也尝试使用复合聚合在键后使用组合两个字段并计算桶的总大小，但这个过程真的很耗时，我的查询超时了。

是否有最佳的方法来实现它?

脚本应该避免，因为它会影响性能。对于您的用例，有3种方法可以实现您所需的结果:

使用复合聚合(您已经尝试过)
使用多项聚合，但这不是内存效率的解决方案

查询:

{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"multi_terms": {
"terms": [
{
"field": "JobId"
},
{
"field": "DesigId"
}
]
}
}
}
}

搜索结果:

"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": [
2,
4
],
"key_as_string": "2|4",
"doc_count": 2
},
{
"key": [
3,
5
],
"key_as_string": "3|5",
"doc_count": 2
},
{
"key": [
2,
12
],
"key_as_string": "2|12",
"doc_count": 1
}
]
}
}

合并后的字段值(即" jobd "one_answers" designid ")应该存储在索引时间本身，因为这是最好的方法。这可以通过使用设置的处理器实现。

PUT /_ingest/pipeline/concat
{
"processors": [
{
"set": {
"field": "combined_field",
"value": "{{JobId}} {{DesigId}}"
}
}
]
}

<<p>指数API/strong>在索引文档时，您需要在每次索引文档时添加pipeline=concat查询参数。假设索引API如下所示:

POST _doc/1?pipeline=concat { "JobId": 2, "DesigId": 12 }
搜索查询:
{ "size": 0, "aggs": { "jobId_and_DesigId": { "terms": { "field":"combined_field.keyword" } } } }
搜索结果:
"aggregations": { "jobId_and_DesigId": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "2 4", "doc_count": 2 }, { "key": "3 5", "doc_count": 2 }, { "key": "2 12", "doc_count": 1 } ] } }

基数聚合只给出近似计数。由于有超过40K的文档，使用精度阈值也将不起作用。

您可以使用脚本化的度量聚合。它将给出准确的计数，但比基数聚合慢得多。

{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"init_script": "state.list = []",
"map_script": """
state.list.add(doc['JobId'].value+'-'+doc['DesigId'].value);
""",
"combine_script": "return state.list;",
"reduce_script":"""
Map uniqueValueMap = new HashMap(); 
int count = 0;
for(shardList in states) {
if(shardList != null) { 
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
} 
return count;
"""
}
}
}
}

相关内容

最新更新

热门标签：