如何按猫鼬/mongodb中具有相似值的特定字段查找一组文档



我想查看集合中是否有一些重复的文档,以便我可以删除或合并类似的记录。

假设没有提供目标值,而只提供目标字段,我所要做的就是根据目标字段查找所有类似的文档。

例如,我的收藏persons包含以下文档:

{
    _id: 1,
    email: "foo@bar.com",
    name: "tom",
    phone: 320513218,
    company: {
        name: "Bar"
        department: "Marketing"
    }
},{
    _id: 2,
    email: "foo@bar.com",
    name: "alex c",
    phone: 7320320813,
    company: {
        name: "Bar"
        department: "Development"
    }
},{
    _id: 3,
    email: "not_foo@not_bar.com",
    name: "alex w",
    phone: 895120981,
    company: {
        name: "Not Bar"
        department: "Development"
    }
},{
    _id: 4,
    email: "not_foo@not_bar.com",
    name: "emily",
    phone: 895120981,
    company: {
        name: "Another Company"
        department: "Marketing"
    }
},{
    _id: 5,
    email: "foo@bar.com",
    name: "emily",
    phone: 7320320813,
    company: {
        name: "Another Company"
        department: "Marketing"
    }
},...
  1. 我想先根据email找到重复的文档,结果我应该得到[{_id: 1, count: 3}, {_id: 2, count: 3}, {_id: 5, count: 3}, {_id: 3, count: 2}, {_id: 4, count: 2}]。(不用担心数组的顺序(

  2. 然后,我想根据phone查找重复的文档,结果应该得到[{_id: 2, count: 2}, {_id: 5, count: 2}, {_id: 3, count: 2}, {_id: 4, count: 2}]。(不用担心数组的顺序(

  3. 然后,我想根据name查找重复的文档,结果我应该得到[{_id: 2, count: 2}, {_id: 3, count: 2}, {_id: 4, count: 2}, {_id: 5, count: 2}]

  4. 最后,我想根据emailphone查找重复的文档,结果我应该得到[{_id: 2, count: 2}, {_id: 5, count: 2}]

(count应为重复记录数(自包括((

我已经尝试了mongo/mongoose提供的mapReduceaggregate方法,但它们无法满足我的期望。

我想要类似"按多个(相似(字段分组和计数"之类的东西

如果您需要更多信息,请告诉我,例如我当前的示例代码。

每个重复搜索都需要单独的聚合。在所有情况下,只需对定义重复项的(可能是复合的(键进行分组,然后将_id推送到数组并计算结果数:

db.test.aggregate([
    { "$group" : { "_id" : KEY, "ids" : { "$push" : "$_id" }, "count" : { "$sum" : 1 } } }
])

例如,对于phone

db.test.aggregate([
    { "$group" : { "_id" : "$phone", "ids" : { "$push" : "$_id" }, "count" : { "$sum" : 1 } } }
])

对于emailphone

db.test.aggregate([
    { "$group" : { "_id" : { "phone" : "$phone", "email" : "$email" }, "ids" : { "$push" : "$_id" }, "count" : { "$sum" : 1 } } }
])

这提供了与您请求的输出不同的输出,例如,对于您的示例文档和您获得phone

{ "_id" : 895120981, "ids" : [3, 4], "count" : 2 },
{ "_id" : 7320320813, "ids" : [2, 5], "count" : 2 },
{ "_id" : 320513218, "ids" : [1], "count" : 1 }

但它具有相同的信息,并且是更简单(更快(的聚合。

要筛选出唯一值,请追加$match阶段:

{ "$match" : { "count" : { "$gt" : 1 } } }

问题 1 的解决方案。

db.test.aggregate(
{ $group: 
        { _id : 
            {email : '$email'},  
                id : {$push :"$_id"},
                count : {$sum:1}        
        }
},
{$unwind:"$id"},
{$group: 
        {_id: 
            {_id:"$id",count:"$count"},
            }
     }
)

最新更新