我有一个名为"allvoice"它的结构如下:
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "Internal",
"type_network" : "local_network",
"month" : "202001"
}
带有字段"originaltimestamp"由于是一天中数据的交易时段,为了便于对白天的大数据进行查询,我使用了查询& aggregate&;和";group"按日期获取订阅者按天的交易(每日交易),然后我写一个名为"dailyvoice"的集合。从现在开始按日期查询交易数据,我会直接在collection "dailyvoice"中查询。这是我的查询。
db.getCollection('allvoice').aggregate(
[
{
"$project": {
"date": { "$toDate": "$originaltimestamp" },
"subscriber_id":1,
"callednumber":1,
"originaltimestamp":1,
"duration": 1,
"maincost": 1,
"type": 1,
"type_network": 1,
"month":1
}},
{ "$group": {
"_id": { "$dateToString": { "format": "%Y-%m-%d", "date": "$date" } },
"data": { $push: "$$ROOT" } ,
"count": { "$sum": 1 }
}},
{
'$out': 'dailyvoice'
}
], { allowDiskUse: true }
)
和"dailyvoice"集合如下:
{
"_id" : "2020-01-13",
"data" : [
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T03:38:03.000Z")
},
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf39"),
"subscriber_id" : "6cf5d711bfa12160eefe62b8bc9c914370eebd70",
"callednumber" : "0241052d42e5491b0529733716fb6fb04804248f",
"originaltimestamp" : "20200113041608",
"duration" : "28",
"maincost" : "644.0",
"type" : "trong nuoc",
"type_network" : "global_network",
"month" : "202001",
"date" : ISODate("2020-01-13T04:16:08.000Z")
},
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf3a"),
"subscriber_id" : "3e554a5a920c469da9faf7375c5265c5cf6fb696",
"callednumber" : "307219a71c028931a4b74f8f5f014ffa16005ee9",
"originaltimestamp" : "20200113051416",
"duration" : "202",
"maincost" : "2961.4",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T05:14:16.000Z")
}
],
"count" : 3.0
}
这里的问题是,如果集合"allvoice"有一个小的数据集,查询语句工作良好,但当集合"allvoice";有一个很大的数据集,大约有114513872条记录(文档),查询遇到了语句溢出("planexecutor error during aggreation")。有没有比增加服务器配置更好的解决方案?请看看找到一种方法来优化我的查询!非常感谢
我将查询优化为按月查询后,结果是我仍然得到错误:" planexecutor错误在聚合期间::由::BSONObj size. ">
db.getCollection('allvoice').aggregate(
[
{ $match: { month: "202001" } },
{
"$group": {
"_id": {
"$dateToString": {
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
}
},
"data": {
$push: {
"subscriber_id": "$subscriber_id",
"type": "$type",
// "originaltimestamp":"$originaltimestamp"
"date": { "$toDate": "$originaltimestamp" },
"month": "$month"
}
},
"count": { "$sum": 1 }
}
},
{
'$out': 'dailyvoice_202001'
}
], { allowDiskUse: true }
)
一些想法:
您不需要查询中的第一个$project
阶段。并且,您可以将{ "$toDate": "$originaltimestamp" }
包含在$group
阶段的_id
中,如下所示:
"_id": {
"$dateToString": {
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
}
}
关于$push: "$$ROOT"
-而不是$$ROOT
,只捕获您最需要(或重要)的字段。这是为了减少内存的使用。例如:
"data": {
$push: {
"subscriber_id": "$subscriber_id",
"type": "$type",
// other required fields...
}
}
最后,您可以考虑将查询限制为一组日期。这将需要在不同的日期范围内多次运行查询,但我认为它可能会更好。例如,匹配month
字段的月份。并且,可以对这个month
进行性能索引。这将需要在查询的开始(第一阶段)包含$match
阶段,例如:
{ $match: { month: "202001" } }
并且,这将查询2020年1月的数据。