Python+MongoDB-游标迭代太慢-未解决

我有一个数据库查找查询，它返回150k个文档，其中每个文档包含三个整数字段和一个日期时间字段。下面的代码试图从光标对象创建一个列表。迭代光标的速度非常慢——大约80秒！通过C++驱动程序进行的相同操作要快几个数量级——这一定是PyMongo的问题吗？

client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())

这个问题以前已经讨论过了，我尝试过这些建议。一种是更改默认的批量大小。所以我尝试了以下方法：

cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)

然而，这并没有产生任何影响。第二个建议是检查是否安装了C扩展——我已经安装了它们，所以这不是问题所在。Mongo数据库安装在同一台机器上，所以它不是网络问题——它在C++中运行良好。。。从Pymongo查询是个问题。

既然MongoDB被宣传为能够处理大数据，那么肯定有一种方法可以通过Python快速检索数据吗？这个问题以前也有人提出过，但我还没有找到解决方案。。。。有人有有效的建议吗？在这种情况下，我检索了15万个文档，但通常查询会检索100万个，所以这对我来说将是一个真正的问题。

谢谢。

我无法复制-我正在加载150k个文档，并在~0.5>~0.8秒内转换为列表。以下是timeit测试脚本的结果——以秒为单位，用于将150000个文档从数据库转换为列表。

--------------------------------------------------
Default batch size
0.530369997025
--------------------------------------------------
Batch Size 1000
0.570069074631
--------------------------------------------------
Batch Size 10000
0.686305046082

这是我的测试脚本：

#!/usr/bin/env python
import timeit
def main():
    """
    Testing loading 150k documents in pymongo
    """
    setup = """
import datetime
from random import randint
from pymongo import MongoClient
connection = MongoClient()
db = connection.test_load
sample = db.sample
if db.sample.count() < 150000:
    connection.drop_database('test_load')
    # Insert 150k sample data
    for i in xrange(15000):
        sample.insert([{"date": datetime.datetime.now(),
                        "int1": randint(0, 1000000),
                        "int2": randint(0, 1000000),
                        "int4": randint(0, 1000000)} for i in xrange(10)])
"""
    stmt = """
from pymongo import MongoClient
connection = MongoClient()
db = connection.test_load
sample = db.sample
cursor = sample.find()
test = list(cursor)
assert len(test) == 150000
"""
    print "-" * 100
    print """Default batch size"""
    t = timeit.Timer(stmt=stmt, setup=setup)
    print t.timeit(1)
    stmt = """
from pymongo import MongoClient
connection = MongoClient()
db = connection.test_load
sample = db.sample
cursor = sample.find()
cursor.batch_size(1000)
test = list(cursor)
assert len(test) == 150000
"""
    print "-" * 100
    print """Batch Size 1000"""
    t = timeit.Timer(stmt=stmt, setup=setup)
    print t.timeit(1)
    stmt = """
from pymongo import MongoClient
connection = MongoClient()
db = connection.test_load
sample = db.sample
cursor = sample.find()
cursor.batch_size(10000)
test = list(cursor)
assert len(test) == 150000
"""
    print "-" * 100
    print """Batch Size 10000"""
    t = timeit.Timer(stmt=stmt, setup=setup)
    print t.timeit(1)
if __name__ == "__main__":
    main()

我很困惑你是怎么得到80秒而不是0.8秒的！我根据你的定义创建了我的样本数据——这与你的定义有多大不同？

不确定如果您返回集合中的每个项（相对于按某些字段查询），这是否会有所帮助，但您是否尝试过为字段创建索引？

db_collection.create_index([("field_name", pymongo.ASCENDING)])
db_collection.reindex()

文档：https://api.mongodb.org/python/current/api/pymongo/collection.html

相关内容

最新更新

热门标签：