使用WindowSummary生成器和外部功能进行预测



WindowSummarizer允许在指定的滚动窗口内捕获时间序列特征。我试图修改我在文档中找到的一个示例。这个功能似乎不适用于实际使用外部特性的模型。

以下是一个基于文档的最小工作示例:

from sktime.forecasting.base import ForecastingHorizon
from sktime.transformations.series.impute import Imputer
from sktime.datasets import load_airline, load_longley
from sktime.forecasting.arima import AutoARIMA
from sktime.forecasting.naive import NaiveForecaster
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.compose import ForecastingPipeline
from sktime.transformations.series.window_summarizer import WindowSummarizer
y, X = load_longley()
y_train, y_test, X_train, X_test = temporal_train_test_split(y, X)
kwargs = {
"lag_config": {
"mean": ["mean", [[3, 0], [4, 0]]],
}
}
Z_train = pd.concat([X_train, y_train], axis=1)
Z_test = pd.concat([X_test, y_test], axis=1)
pipe = ForecastingPipeline(
steps=[
("ws", WindowSummarizer(**kwargs, n_jobs=1, target_cols=["GNP"])),
("imputer",Imputer('mean')),
("forecaster", NaiveForecaster(strategy="drift")),
]
)
pipe_return = pipe.fit(y_train, Z_train)
y_pred = pipe_return.predict(fh=fh, X=Z_test) # this works

如果我们将预测器更改为使用工程功能的预测器,事情不再那么顺利了:

pipe = ForecastingPipeline(
steps=[
("ws", WindowSummarizer(**kwargs, n_jobs=1, target_cols=["GNP"])),
("imputer",Imputer('mean')),
("forecaster", AutoARIMA()),
]
)
pipe.fit(y_train, X=Z_train)
pipe.predict(fh=fh,X = Z_test) # this throws an error

我怀疑这与Z_train和Z_test之间没有延续有关。第二件事是Imputer。我认为它没有按应有的方式工作——在拟合之后,它应该保存用于填充空字段的值。

ws = pipe.steps_[0][1]
imp = pipe.steps_[1][1]
imp._transform(ws._transform(Z_test)) 

给出

GNP_mean_3_0    GNP_mean_4_0    GNPDEFL     UNEMP   ARMED   POP     TOTEMP
1959    501159.333333   NaN     112.6   3813.0  2552.0  123366.0    68655.0
1960    501159.333333   NaN     114.2   3931.0  2514.0  125368.0    69564.0
1961    501159.333333   NaN     115.7   4806.0  2572.0  127852.0    69331.0
1962    501159.333333   NaN     116.9   4007.0  2827.0  130081.0    70551.0

库版本.10和更新版本已经修改了WindowSummarizer的行为。它应该毫无问题地工作。

我想我还有工作要做。这不是最优雅的解决方案,但它完成了任务。我以这样的方式修改了WindowSummarizer,它保存了计算所有聚合所需的最小X窗口保存X的所有可见记录(默认选项(。

每当应用.transform时,汇总器都会尝试更新窗口并重新计算(正确!(聚合。为了简单起见,我在这里只关注汇总器和一个更简单的数据集。

def update_X(self,X):
if self.target_cols==None:
cols = X.columns
else:
cols = self.target_cols
X_window = self.X_window
X_window = pd.concat([X_window,X[cols]],axis=0)
X_window = X_window.groupby(X_window.index).first()
# would remember only last #min_window rows
# self.X_window = X_window.iloc[-self.min_window:]
# would remember all rows
self.X_window = X_window
def window_size(windows):
try:
is_list_of_lists = all(isinstance(i, list) for i in windows)
if is_list_of_lists:
size = max(map(sum,windows))
else:
size = sum(windows)
return size

except:
print('error')

class WS(WindowSummarizer):
def __init__(
self,
lag_config,
n_jobs=-1,
target_cols=None,
truncate=None,
):
self.lag_config = lag_config
self.n_jobs = n_jobs
self.target_cols = target_cols
self.truncate = truncate
self._converter_store_X = dict()

# calculates the minimal window required to calculate the window summaries in lag_config
self.min_window = max([window_size(x[1]) for key,x in lag_config.items()])
# empty data frame for data window
self.X_window = pd.DataFrame()

super(WindowSummarizer).__init__()

def _fit(self, X, y=None):
update_X(self,X)
super()._fit(X, y)

def _transform(self, X, y=None):
X_window = pd.concat([self.X_window,X],axis=0)
X_window = X_window.groupby(X_window.index).first()
X_transformed = super()._transform(X_window, y)
update_X(self,X)
return X_transformed.loc[X.index]

这里有一个小测试:

y = load_airline()
y_train, y_test = temporal_train_test_split(y.iloc[:10])
fh = ForecastingHorizon(y_test.index, is_relative=False)
kwargs = {
"lag_config": {
"mean": ["mean", [[3, 1], [4, 1]]],
}
}
ws = WS(**kwargs, n_jobs=1)
ws.fit(pd.DataFrame(y_train),y_train)
ws.transform(pd.DataFrame(y_test))

Number of airline passengers_mean_3_1   Number of airline passengers_mean_4_1
1949-08     128.333333  129.25
1949-09     134.666667  133.25
1949-10     143.666667  138.00

相关内容

  • 没有找到相关文章

最新更新