Mongodb数据统计可视化使用matplotlib



我想从mongodb中使用matplotlib的数据中获得可视化统计数据,但我现在使用的方式真的很奇怪。

我查询了mongodb 30次获取每天的数据,这已经是缓慢和肮脏的,特别是当我从其他地方而不是在服务器上获得结果。我想知道是否有更好的/干净的方法来获取每小时、每天、每月和每年的统计数据?

下面是我现在使用的一些代码(获取每天的统计数据):

from datetime import datetime, date, time, timedelta
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from my_conn import my_mongodb
t1 = []
t2 = []
today = datetime.combine(date.today(), time())
with my_mongodb() as m:
    for i in range(30):
        day = today - timedelta(days = i)
        t1 = [m.data.find({"time": {"$gte": day, "$lt": day + timedelta(days = 1)}}).count()] + t1
        t2 = [m.data.find({"deleted": 0, "time": {"$gte": day, "$lt": day + timedelta(days = 1)}}).count()] + t2
x = range(30)
N = len(x)
def format_date(x, pos=None):
    day = today - timedelta(days = (N - x - 1))
    return day.strftime('%m/%d')
plt.bar(range(len(t1)), t1, align='center', color="#4788d2") #All
plt.bar(range(len(t2)), t2, align='center', color="#0c3688") #Not-deleted
plt.xticks(range(len(x)), [format_date(i) for i in x], size='small', rotation=30)
plt.grid(axis = "y")
plt.show()

UPDATE:

我从根本上误解了这个问题。Felix正在查询mongoDB,以确定每个范围内有多少项;因此,我的方法不起作用,因为我试图要求mongoDB 项目。Felix有很多数据,所以这完全是不合理的。

Felix,这里有一个更新的函数,它应该做你想做的:

def getDataFromLast(num, quantum):
    m = my_mongodb()
    all = []
    not_deleted = []
    today = datetime.combine(date.today(), time())
    for i in range(num+1)[-1]: # start from oldest
        day = today - i*quantum
        time_query = {"$gte":day, "$lt": day+quantum}
        all.extend(m.data.find({"time":time_query}).count())
        not_deleted.extend(m.data.find({"deleted":0, "time":time_query}).count())
    return all, not_deleted

量子是回头看的"台阶"。例如,如果我们想看最后一个12小时,我设置quantum = timedelta(hours=1)num = 12。一个更新后的示例用法是:

from datetime import datetime, date, time, timedelta
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from my_conn import my_mongodb
#def getDataFromLast(num, quantum) as defined above
def format_date(x, N, pos=None):
    """ This is your format_date function. It now takes N
        (I still don't really understand what it is, though)
        as an argument instead of assuming that it's a global."""
    day = date.today() - timedelta(days=N-x-1)
    return day.strftime('%m%d')
def plotBar(data, color):
    plt.bar(range(len(data)), data, align='center', color=color)

N = 30 # define the range that we want to look at
all, valid = getDataFromLast(N, timedelta(days=1)) # get the data
plotBar(all, "#4788d2") # plot both deleted and non-deleted data
plotBar(valid, "#0c3688") # plot only the valid data
plt.xticks(range(N), [format_date(i) for i in range(N)], size='small', rotation=30)
plt.grid(axis="y")
plt.show()  

原始:

好了,这是我为你重构的尝试。Blubber建议学习JS和MapReduce。只要您遵循他的其他建议就没有必要:在时间字段上创建索引,并减少查询次数。这是我最好的尝试,还有一点重构。我有很多问题和评论。

从:

with my_mongodb() as m:
    for i in range(30):
        day = today - timedelta(days = i)
        t1 = [m.data.find({"time": {"$gte": day, "$lt": day + timedelta(days = 1)}}).count()] + t1
        t2 = [m.data.find({"deleted": 0, "time": {"$gte": day, "$lt": day + timedelta(days = 1)}}).count()] + t2

你正在做一个mongoDB请求来查找过去30天每天的所有数据。你为什么不用一个请求呢?一旦你有了所有的数据,为什么不直接过滤掉被删除的数据呢?

with my_mongodb() as m:
    today = date.today() # not sure why you were combining this with time(). It's the datetime representation of the current time.time()
    start_date = today -timedelta(days=30)
    t1 = m.find({"time": {"$gte":start_date}}) # all data since start_date (30 days ago)
    t2 = filter(lambda x: x['deleted'] == 0, all_data) # all data since start_date that isn't deleted

我真的不知道你为什么要发出60个请求(30 * 2,一个用于所有数据,一个用于未删除)。你每天积累数据有什么特别的原因吗?

然后,你有:

x = range(30)
N = len(x)

为什么不:

N = 30
x = range(N)

len(range(x)x相等,但占用计算时间。你最初写的方式有点…奇怪。

这是我的尝试,我建议以尽可能一般的方式进行更改。

from datetime import datetime, date, time, timedelta
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from my_conn import my_mongodb
def getDataFromLast(delta):
    """ Delta is a timedelta for however long ago you want to look
        back. For instance, to find everything within the last month,
        delta should = timedelta(days=30). Last hour? timedelta(hours=1)."""
    m = my_mongodb() # what exactly is this? hopefully I'm using it correctly.
    today = date.today() # was there a reason you didn't use this originally?
    start_date = today - delta
    all_data = m.data.find({"time": {"$gte": start_date}})
    valid_data = filter(lambda x: x['deleted'] == 0, all) # all data that isn't deleted
    return all_data, valid_data
def format_date(x, N, pos=None):
    """ This is your format_date function. It now takes N
        (I still don't really understand what it is, though)
        as an argument instead of assuming that it's a global."""
    day = date.today() - timedelta(days=N-x-1)
    return day.strftime('%m%d')
def plotBar(data, color):
    plt.bar(range(len(data)), data, align='center', color=color)
N = 30 # define the range that we want to look at
all, valid = getDataFromLast(timedelta(days=N))
plotBar(all, "#4788d2") # plot both deleted and non-deleted data
plotBar(valid, "#0c3688") # plot only the valid data
plt.xticks(range(N), [format_date(i) for i in range(N)], size='small', rotation=30)
plt.grid(axis="y")
plt.show()  

多亏了@Blubber,我现在找到了一种更好的方法来处理这个目的,使用Map/Reduce

取数据部分被重写为:

from dateutil import parser
parse_time = lambda s: parser.parse(s, ignoretz = True)
func_map = """
function() {
    if (this.hasOwnProperty("time"))
        emit(this.time.getUTCFullYear() + "/" + (this.time.getUTCMonth() + 1) + "/" + this.time.getUTCDate(),
        {
            count: 1,
            not_deleted: (1 - this.deleted)
        });
}
"""
func_reduce = """
function(key, values) {
    var result = {count: 0, not_deleted: 0};
    values.forEach(function(value) {
        result.count += value.count;
        result.not_deleted += value.not_deleted;
    });
    return result;
}
"""
with my_mongo() as m:
    result = m.data.inline_map_reduce(func_map, func_reduce)
    dataset = {parse_time(day['_id']): day['value']['not_deleted'] for day in result}
    dataset2 = {parse_time(day['_id']): day['value']['count'] for day in result}

因为我对JS很陌生,一定有更好的方法来写那些JS函数:)

最新更新