在 Pandas 中通过窗口子句实现 MSSQL 的分区

我正在将MSSQL数据库移动到MYSQL，并决定将一些存储过程移动到Python而不是在MYSQL中重写。我在Python 3.5.4上使用Pandas 0.23。

旧的MSSQL基础使用了许多窗口函数。到目前为止，我已经成功地使用pandas.Dataframe.rolling使用 Pandas 进行转换，如下所示：

MSSQL

AVG([Close]) OVER (ORDER BY DateValue ROWS 13 PRECEDING) AS MA14

蟒

df['MA14'] = df.Close.rolling(14).mean()

我一直在为 python 中 MSSQL 窗口函数的PARTITION BY部分研究解决方案。自发布以来，我正在根据反馈与pandas groupby一起研究解决方案......

https://pandas.pydata.org/pandas-docs/version/0.23.0/groupby.html

例如，假设 MSSQL 是：

AVG([Close]) OVER (PARTITION BY myCol ORDER BY DateValue ROWS 13 PRECEDING) AS MA14

到目前为止，我得出的结论：

Col1包含我希望rollinggroupby和应用功能的分类数据。还有一个日期列，因此Col1和date column将表示df中的唯一记录。

1. 提供 Col1 的均值，尽管是聚合的

grouped = df.groupby(['Col1']).mean()
print(grouped.tail(20))

2. 似乎正在应用 Col1 的每个分类组的滚动平均值。我所追求的

grouped = df.groupby(['Col1']).Close.rolling(14).mean()
print(grouped.tail(20))

3 分配给 df 作为新的列 RM

df['RM'] = df.groupby(['Col1']).Close.rolling(14).mean()
print(df.tail(20))

它不喜欢我收到错误的这一步...

TypeError: incompatible index of inserted column with frame index

我举了一个简单的例子，可能会有所帮助：

如何在 #1 或类似类别的 df 中获取 #2 的结果。

import numpy as np
import pandas as pd
dta = {'Colour': ['Red','Red','Blue','Blue','Red','Red','Blue','Red','Blue','Blue','Blue','Red'],
'Year': [2014,2015,2014,2015,2016,2017,2018,2018,2016,2017,2013,2013],
'Val':[87,78,863,673,74,81,756,78,694,701,804,69]}
df = pd.DataFrame(dta)
df = df.sort_values(by=['Colour','Year'], ascending=True)
print(df)
#1 add calculated columns to the df. This averages all of column Val
df['ValMA3'] = df.Val.rolling(3).mean().round(0)
print (df)

#2 Group by Colour. This is calculating average by groups correctly. 
# where are the other columns from my original dataframe?
#what if I have multiple calculated columns to add? 
gf = df.groupby(['Colour'])
gf = gf.Val.rolling(3).mean().round(0)
print(gf)

我很确定转换函数可以提供帮助。

df.groupby('Col1'')['Val'].transform(lambda x: x.rolling(3, 2).mean())

例如，值 3 是滚动窗口的步长，2 是最小周期数。

(只是不要忘记在应用运行计算之前对数据框进行排序(

相关内容

最新更新

热门标签：