使用聚合和$group查询时内存溢出错误



我有一个名为"allvoice"它的结构如下:

{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "Internal",
"type_network" : "local_network",
"month" : "202001"
}

带有字段"originaltimestamp"由于是一天中数据的交易时段,为了便于对白天的大数据进行查询,我使用了查询& aggregate&;和";group"按日期获取订阅者按天的交易(每日交易),然后我写一个名为"dailyvoice"的集合。从现在开始按日期查询交易数据,我会直接在collection "dailyvoice"中查询。这是我的查询。

db.getCollection('allvoice').aggregate(
[ 
{ 
"$project": {
"date": { "$toDate": "$originaltimestamp" },
"subscriber_id":1,
"callednumber":1,
"originaltimestamp":1,
"duration": 1,
"maincost": 1,
"type": 1,
"type_network": 1,
"month":1
}},

{ "$group": {
"_id": { "$dateToString": { "format": "%Y-%m-%d", "date": "$date" } },
"data": { $push: "$$ROOT" } ,
"count": { "$sum": 1 }
}},
{
'$out': 'dailyvoice' 
}
],  { allowDiskUse: true }
)

和"dailyvoice"集合如下:

{
"_id" : "2020-01-13",
"data" : [ 
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T03:38:03.000Z")
}, 
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf39"),
"subscriber_id" : "6cf5d711bfa12160eefe62b8bc9c914370eebd70",
"callednumber" : "0241052d42e5491b0529733716fb6fb04804248f",
"originaltimestamp" : "20200113041608",
"duration" : "28",
"maincost" : "644.0",
"type" : "trong nuoc",
"type_network" : "global_network",
"month" : "202001",
"date" : ISODate("2020-01-13T04:16:08.000Z")
}, 
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf3a"),
"subscriber_id" : "3e554a5a920c469da9faf7375c5265c5cf6fb696",
"callednumber" : "307219a71c028931a4b74f8f5f014ffa16005ee9",
"originaltimestamp" : "20200113051416",
"duration" : "202",
"maincost" : "2961.4",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T05:14:16.000Z")
}
],
"count" : 3.0
}

这里的问题是,如果集合"allvoice"有一个小的数据集,查询语句工作良好,但当集合"allvoice";有一个很大的数据集,大约有114513872条记录(文档),查询遇到了语句溢出("planexecutor error during aggreation")。有没有比增加服务器配置更好的解决方案?请看看找到一种方法来优化我的查询!非常感谢

我将查询优化为按月查询后,结果是我仍然得到错误:" planexecutor错误在聚合期间::由::BSONObj size. ">

db.getCollection('allvoice').aggregate(
[
{ $match: { month: "202001" } },
{
"$group": {
"_id": {
"$dateToString": {
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
}
},
"data": {
$push: {
"subscriber_id": "$subscriber_id",
"type": "$type",
// "originaltimestamp":"$originaltimestamp"
"date": { "$toDate": "$originaltimestamp" },
"month": "$month"
}
},
"count": { "$sum": 1 }
}
},

{
'$out': 'dailyvoice_202001'
}
], { allowDiskUse: true }
)

一些想法:

您不需要查询中的第一个$project阶段。并且,您可以将{ "$toDate": "$originaltimestamp" }包含在$group阶段的_id中,如下所示:

"_id": { 
"$dateToString": { 
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" } 
} 
}

关于$push: "$$ROOT"-而不是$$ROOT,只捕获您最需要(或重要)的字段。这是为了减少内存的使用。例如:

"data": { 
$push: { 
"subscriber_id": "$subscriber_id",
"type": "$type",
// other required fields...
} 
}

最后,您可以考虑将查询限制为一组日期。这将需要在不同的日期范围内多次运行查询,但我认为它可能会更好。例如,匹配month字段的月份。并且,可以对这个month进行性能索引。这将需要在查询的开始(第一阶段)包含$match阶段,例如:

{ $match: { month: "202001" } }

并且,这将查询2020年1月的数据。

最新更新