我在IPython Notebook中使用sklearn的GridSearchCV与n_jobs = 4并行选择模型参数的代码。
它工作得很好,直到我添加自定义变压器到管道。只要我在管道中添加一个自定义转换器,它就会开始"挂起"…即进程永远不会完成,即使CPU使用率下降到零。
当我设置n_jobs = 1时,即使使用自定义变压器也可以正常工作。
下面是重现问题的代码(复制&粘贴到IPython Notebook单元格中):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
iris = load_iris()
X = iris["data"]
y = iris["target"]
class DummyTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
cv = GridSearchCV(estimator=Pipeline(steps=[('dummy', DummyTransformer()),
('rf', RandomForestClassifier())]),
param_grid={"rf__n_estimators": [10, 100]},
scoring="f1_weighted",
cv=10,
n_jobs=2) # n_jobs = 1 works fine, but setting n_jobs = 2 makes the script run forever... :-(
cv.fit(X, y)
cv.grid_scores_
设置n_jobs=1,它将工作,设置n_jobs为>1,它将永远不会完成。
我使用Anaconda发行版附带的ippython Notebook。ippython Notebook v3.2, Python v3.4 on Windows 8 x64.
p。:这是整个笔记本的要点https://gist.github.com/anonymous/95b65991e96f5361404c
pp。我刚刚注意到,当代码挂起时,"ipython notebook"进程在控制台窗口输出以下错误:
Process SpawnPoolWorker-12:
Traceback (most recent call last):
File "C:Anaconda3libmultiprocessingprocess.py", line 254, in _bootstrap
self.run()
File "C:Anaconda3libmultiprocessingprocess.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:Anaconda3libmultiprocessingpool.py", line 108, in worker
task = get()
File "C:Anaconda3libsite-packagessklearnexternalsjoblibpool.py", line 363, in get
return recv()
File "C:Anaconda3libmultiprocessingconnection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'DummyTransformer' on <module '__main__' (built-in)>
经过一番谷歌搜索,我发现了以下sklearn问题:https://github.com/scikit-learn/scikit-learn/issues/2889
其中amueller说:
"试着不要在笔记本中定义度量,而是在一个单独的文件中定义然后导入它。我想那会解决问题的。"
将DummyTransformer放入utils.py并在笔记本中使用"from utils import *"确实"修复"了它。不过,我宁愿把它称为一种变通方法。
如果有人有更好/真正的解决方案,请添加答案!