我必须使用Scikit Lean的KNeighborsClassifier在Python中使用用户定义的函数来比较时间序列。
knn = KNeighborsClassifier(n_neighbors=1,weights='distance',metric='pyfunc',func=dtw_dist)
问题是KNeighborsClassifier似乎不支持我的训练数据。它们是时间序列,所以它们是不同大小的列表。当我尝试使用fit
方法(knn.fit(X,Y)
)时,KNeighborsClassifier给了我这个错误消息:
ValueError: data type not understood
似乎KNeighborsClassifier只支持相同大小的训练集(只有相同长度的时间序列会被接受,但这不是我的情况),但我的老师告诉我使用KNeighborsClassifier。所以我不知道该怎么办……
任何想法?
两个(或一个…)选项,据我所知:
- 预先计算距离(
KNeighborsClassifier
似乎不直接支持,其他聚类算法可以,例如光谱聚类)。 - 使用
NaN
s将数据转换为方形,并在自定义距离函数中相应地处理这些数据。
使用NaN
s 'Square'您的数据
选项2。假设我们有以下数据,其中每行表示一个时间序列:
import numpy as np
series = [
[1,2,3,4],
[1,2,3],
[1],
[1,2,3,4,5,6,7,8]
]
我们简单地通过添加nan使数据平方:
def make_square(jagged):
# Careful: this mutates the series list of list
max_cols = max(map(len, jagged))
for row in jagged:
row.extend([None] * (max_cols - len(row)))
return np.array(jagged, dtype=np.float)
make_square(series)
array([[ 1., 2., 3., 4., nan, nan, nan, nan],
[ 1., 2., 3., nan, nan, nan, nan, nan],
[ 1., nan, nan, nan, nan, nan, nan, nan],
[ 1., 2., 3., 4., 5., 6., 7., 8.]])
现在数据"适合"算法。你只需要调整你的距离函数来考虑NaN
秒。
预计算并使用缓存函数
哦,我们可能也可以做选项1(假设你有N
时间序列):
- 将距离预计算为
(N, N)
距离矩阵D
- 创建
(N, 1)
矩阵,该矩阵仅为[0, N)
之间的范围(即,距离矩阵中系列的索引) - 创建距离函数
wrapper
- 使用此
wrapper
作为距离函数
wrapper
功能:
def wrapper(row1, row2):
# might have to fiddle a bit here, but i think this retrieves the indices.
i1, i2 = row1[0], row2[0]
return D[i1, i2]
好的,我希望它清楚。
完整的例子#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
from mlpy import dtw_std # I dont know if you are using this one: it doesnt matter.
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Example data
series = [
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3],
[1],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 5, 6, 7, 8],
[1, 2, 4, 5, 6, 7, 8],
]
# I dont know.. these seemed to make sense to me!
y = np.array([
0,
0,
0,
0,
1,
2,
2,
2
])
# Compute the distance matrix
N = len(series)
D = np.zeros((N, N))
for i in range(N):
for j in range(i+1, N):
D[i, j] = dtw_std(series[i], series[j])
D[j, i] = D[i, j]
print D
# Create the fake data matrix: just the indices of the timeseries
X = np.arange(N).reshape((N, 1))
# Create the wrapper function that returns the correct distance
def wrapper(row1, row2):
# cast to int to prevent warnings: sklearn converts our integer indices to floats.
i1, i2 = int(row1[0]), int(row2[0])
return D[i1, i2]
# Only the ball_tree algorith seems to accept a custom function
knn = KNeighborsClassifier(weights='distance', algorithm='ball_tree', metric='pyfunc', func=wrapper)
knn.fit(X, y)
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.kneighbors(X[0])
# (array([[ 0., 0., 0., 1., 6.]]), array([[1, 2, 0, 3, 4]]))
print knn.predict(X)
# [0 0 0 0 1 2 2 2]