我正在处理scikit-learn
中的加州住房数据集。我想设计两个二进制功能:"距离旧金山10公里以内"one_answers"距离洛杉矶10公里以内。我创建了一个自定义转换器,它本身运行良好,但当我将其放入ColumnTransformer
时会抛出TypeError
。这是代码:
from math import radians
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.metrics.pairwise import haversine_distances
from sklearn.datasets import fetch_california_housing
import numpy as np
import pandas as pd
# Import data into DataFrame
data = fetch_california_housing()
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = data['target']
# Custom transformer for 'Latitude' and 'Longitude' cols
class NearCity(BaseEstimator, TransformerMixin):
def __init__(self, distance=10):
self.la = (34.05, -118.24)
self.sf = (37.77, -122.41)
self.dis = distance
def calc_dist(self, coords_1, coords_2):
coords_1 = [radians(_) for _ in coords_1]
coords_2 = [radians(_) for _ in coords_2]
result = haversine_distances([coords_1, coords_2])[0,-1]
return result * 6_371
def fit(self, X, y=None):
return self
def transform(self, X):
dist_to_sf = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.sf)
dist_to_sf = (dist_to_sf < self.dis).astype(int)
dist_to_la = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.la)
dist_to_la = (dist_to_la < self.dis).astype(int)
X_trans = np.column_stack((X, dist_to_sf, dist_to_la))
return X_trans
ct = ColumnTransformer([('near_city', NearCity(), ['Latitude', 'Longitude'])],
remainder='passthrough')
ct.fit_transform(X)
#> /Users/.../anaconda3/envs/data3/lib/python3.7/site-packages/sklearn/base.py:197: FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
#> FutureWarning)
#> Traceback (most recent call last):
#> <ipython-input-13-603f6cd4afd3> in transform(self, X)
#> 17 def transform(self, X):
#> 18 dist_to_sf = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.sf)
#> ---> 19 dist_to_sf = (dist_to_sf < self.dis).astype(int)
#> 20
#> 21 dist_to_la = np.apply_along_axis(self.calc_dist, 1, X, coords_2=self.la)
#> TypeError: '<' not supported between instances of 'float' and 'NoneType'
由reprepy包于2020-04-23创建
问题是self.dis
属性没有持久存在。如果我自己实例化转换器,没有问题:self.dis = distance = 10
。但在ColumnTransformer
中,它最终成为NoneType
。奇怪的是,如果我只是在self.dis = 10
中硬编码,它就能工作。
人们认为发生了什么?
Session info --------------------------------------------------------------------
Platform: Darwin-18.7.0-x86_64-i386-64bit (64-bit)
Python: 3.7
Date: 2020-04-23
Packages ------------------------------------------------------------------------
numpy==1.18.1
pandas==1.0.1
reprexpy==0.3.0
scikit-learn==0.22.1
发现问题出在sklearn.base
中。
deep_items = value.get_params().items()
get_params()
函数查看init
参数以确定类参数是什么,然后假设它们与内部变量名相同。
所以我可以通过将我的init
方法更改为:来解决这个问题
def __init__(self, distance=10):
self.la = (34.05, -118.24)
self.sf = (37.77, -122.41)
self.distance = distance # <-- give same name
非常感谢我的一位同事,他发现了这一点!