填写 Queryset Django 中缺失的数据



我继承了一个使用DjangoRestFramework的AngularJS/Django应用程序和一个Postgres DB,该数据库正在从AngularJS重新平台化为React/Redux。 我们尝试做的一件事是使用 amCharts4 呈现时间序列数据。我们遇到的一个问题(以及其他许多问题(是在数据库中可能没有条目的时间范围内呈现数据。例如,我们的结果可能如下所示:

[
{
"date": "2020-01-16T00:00:00.000Z",
"result": 3
},
{
"date": "2020-01-18T00:00:00.000Z",
"result": 2
}
]

并希望它们看起来像这样:

[
{
"date": "2020-01-16T00:00:00.000Z",
"result": 3
},
{
"date": "2020-01-17T00:00:00.000Z",
"result": 0
},
{
"date": "2020-01-18T00:00:00.000Z",
"result": 2
}
]

此外,我们还有每个时间事件有多个数据点的数据:

[
{
"date": "2020-01-13T00:00:00Z",
"result": 1,
"name": "Yes"
},
{
"date": "2020-01-14T00:00:00Z",
"result": 1,
"name": "No"
},
{
"date": "2020-01-16T00:00:00Z",
"result": 1,
"name": "No"
}
]

并希望在任何没有结果的日期的任何name用 0 填充数据:

[
{
"date": "2020-01-13T00:00:00Z",
"result": 1,
"name": "Yes"
},
{
"date": "2020-01-13T00:00:00Z",
"result": 0,
"name": "No"
},
{
"date": "2020-01-14T00:00:00Z",
"result": 0,
"name": "Yes"
},
{
"date": "2020-01-14T00:00:00Z",
"result": 1,
"name": "No"
},
{
"date": "2020-01-15T00:00:00Z",
"result": 0,
"name": "Yes"
},
{
"date": "2020-01-15T00:00:00Z",
"result": 0,
"name": "No"
},
{
"date": "2020-01-16T00:00:00Z",
"result": 0,
"name": "Yes"
},
{
"date": "2020-01-16T00:00:00Z",
"result": 1,
"name": "No"
}
]

这些结果的范围也不一定受日期中的开始和结束日期的约束,但可以由用户指定。在这种情况下,我们需要为这些范围内所有日期的所有选项填写零值结果。

我知道amChartsskipEmptyPeriods属性(amCharts4 - skipEmptyPeriods(,但我的前端工程师告诉我,这不适用于多条趋势线的情况(即,每个日期有多个选项的第二种情况(。此外,这不是真正的前端问题,并且会导致性能问题。

此外,我尝试将Postgresql的generate_series函数与coalescePostgresql - generate_series一起使用,但无法将其用于第二种情况。

目前,我正在 Pandas 中尝试这个(我从未使用过(,并且有一个解决方案来解决每个日期单个条目的第一个问题,但同样,并且每个日期有多个条目的第二种情况遇到了麻烦:

from_date = request.query_params.get("from_date")
to_date = request.query_params.get("to_date")
# let's do some zero plotting
filtered_queryset = list(filtered_queryset)
if from_date:
from_date = datetime.strptime(from_date, "%Y-%m-%d").astimezone(pytz.UTC)
else:
from_date = filtered_queryset[0]["date"]
if to_date:
to_date = datetime.strptime(to_date, "%Y-%m-%d").astimezone(pytz.UTC)
_now = localtime(now()).astimezone(pytz.UTC)
to_date = min(to_date, _now)
else:
to_date = localtime(now()).astimezone(pytz.UTC)
pandas_freq_map = {"day": "D", "week": "W-MON", "month": "MS"}
freq = pandas_freq_map.get(request.query_params.get("frequency"))
idx = pd.date_range(from_date.date(), to_date.date(), freq=freq)
df = pd.DataFrame(list(filtered_queryset))
datetime_series = pd.to_datetime(df["date"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
df = df.set_index(datetime_index)
df.drop("date", axis=1, inplace=True)
df = df.asfreq(freq)
df = df.reindex(idx, fill_value=0)
df_json = json.JSONDecoder().decode(df.to_json(date_format="iso"))
# this (result or 0) tomfoolery is bc I don't understand why pandas sometimes reindexes with null as the fill_value
prepared_response = [{"date": date, "result": (result or 0)} for date, result in df_json["result"].items()]

下面是用熊猫创建解决方案的尝试。基本上,您可以重新采样,然后使用日期范围重新索引,但是对于复合索引,这变得有点笨拙

设置数据

import pandas as pd
data = [    { "date": "2020-01-16T00:00:00.000Z", "result": 3 }, 
{ "date": "2020-01-18T00:00:00.000Z", "result": 2 }, 
{ "date": "2020-01-13T00:00:00Z", "result": 1, "name": "Yes" }, 
{ "date": "2020-01-14T00:00:00Z", "result": 1, "name": "No" }, 
{ "date": "2020-01-16T00:00:00Z", "result": 1, "name": "No" }]
# build dataframe
df = pd.DataFrame(data )
df.name = df.name.fillna("No")
df.date = pd.to_datetime( df.date)

然后处理数据

# set up date range
idx = pd.date_range( df.date.min() , df.date.max() , freq="H")
# resample yes/no for name separately
df = df.set_index(["name", "date"]).sort_index()
no = df.loc["No"].resample( rule="60min").sum().reset_index()
no["Name"] = ["No"] * len(no)
no.set_index( ["Name", "date"], inplace=True)
yes = df.loc["Yes"].resample( rule="60min").sum().reset_index()
yes["Name"] = ["Yes"] * len(yes)
yes.set_index( ["Name", "date"], inplace=True)
# reindex with the full date range
yes = yes.reindex(pd.MultiIndex.from_arrays([["Yes"]*len(idx), idx], names=('Name', 'date')), fill_value=0)
no = no.reindex(pd.MultiIndex.from_arrays([["No"]*len(idx), idx], names=('Name', 'date')), fill_value=0)
# merge and create output (dateformat has to be adjusted)
df = pd.concat( [yes, no], axis=0)
df.reset_index().to_dict('records')

结果

[{'Name': 'Yes',
'date': Timestamp('2020-01-13 00:00:00+0000', tz='UTC'),
'result': 1},
{'Name': 'Yes',
'date': Timestamp('2020-01-13 01:00:00+0000', tz='UTC'),
'result': 0}, ....
]

继续使用 Postgres 解决方案,确实找到了一个有效的查询:

WITH
unnested_select AS (
SELECT unnest(forms_completedformfield.value_text_array) as unnested_array,
date_trunc('day', created) as created
FROM forms_completedformfield
WHERE forms_completedformfield.completed_survey_id =
ANY(
ARRAY['815251ac-3891-4206-b876-d17898b74e66'::uuid, '74aea6f5-9860-4fe5-8820-68a279726c83'::uuid, '173ea91f-0dc8-4a6c-b330-7c3cee13e1b4'::uuid]
)
GROUP BY unnested_array,
created
),
range_counts AS (
SELECT date_trunc('day', unnested_select.created) as date,
count(unnested_select.unnested_array) as ct,
unnested_select.unnested_array as ar
FROM unnested_select
WHERE unnested_select.unnested_array =
ANY(
ARRAY['2b0076f1-7be5-4e52-9879-47e4eeafe175']
) 
GROUP BY unnested_select.unnested_array,
unnested_select.created
),
range_sums AS (
SELECT date_trunc('day', unnested_select.created) as date,
count(unnested_select.unnested_array) as ct
FROM unnested_select
GROUP BY unnested_select.created
),
range_values AS ( 
SELECT date_trunc('day', min(created)) as minval,
date_trunc('day', max(created)) as maxval
FROM unnested_select
),
frequency_range AS (
SELECT generate_series(minval, maxval, '1 day'::interval) as date
FROM range_values
),
field_options AS (
SELECT
DISTINCT unnested_select.unnested_array as ar,
frequency_range.date
FROM unnested_select
CROSS JOIN frequency_range
)
SELECT  
frequency_range.date as fd,
field_options.ar as far,
range_counts.ar as rar,
range_counts.ct as ct
FROM frequency_range
LEFT OUTER JOIN field_options ON frequency_range.date = field_options.date
LEFT OUTER JOIN range_counts ON frequency_range.date = range_counts.date and field_options.ar = range_counts.ar
ORDER BY 
frequency_range.date

显然,ARRAYs 中的硬编码值将被替换。