执行示例代码时,我遇到以下问题:" RuntimeError:尚未优化管道。请首先致电FIT((。
Python中TPOT自动化机器学习的问题。我正在尝试以一个示例:数据集2:蘑菇分类(https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9(
源代码:https://www.kaggle.com/discdiver/tpot-mushroom-classification-task/
我试图更改tpot.fit的位置(x_train,y_train(,但它不能解决问题。
库
import time
import gc
import pandas as pd
import numpy as np
import seaborn as sns
import timeit
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5, palette="colorblind")
import category_encoders
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
# Read data
df_cogumelo = pd.read_csv('agaricus-lepiota.csv')
# Visualization
pd.options.display.max_columns = 200
pd.options.display.width = 200
# separate out X
X = df_cogumelo.reindex(columns=[x for x in df_cogumelo.columns.values if x != 'class'])
X = X.apply(LabelEncoder().fit_transform)
# separate out y
y = df_cogumelo.reindex(columns=['class'])
print(y['class'].value_counts())
y = np.ravel(y) # flatten the y array
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=10)
print(X_train.describe())
print("nnn")
print(X_train.info())
# generation and population_size determine how many populations are made.
tpot = TPOTClassifier(verbosity=3,
scoring="accuracy",
random_state=10,
periodic_checkpoint_folder="tpot_mushroom_results",
n_jobs=-1,
generations=2,
population_size=10, use_dask=True) #use_dask=True
times = []
scores = []
winning_pipes = []
# run several fits
for x in range(10):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mushroom.py')
# output results
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)
#The expected result is as follows:
#https://www.kaggle.com/discdiver/tpot-#mushroom-classification-task/
删除" use_dask = true; quot;解决了我的错误。
您的问题不是代码,而是您的数据。该蘑菇数据集没有标题行。进入文件并插入新的第一行,并标记列(DOENS没关系(,以确保最后一列被命名为" class"(小写C(。那应该解决问题。如果您查看输出,则当打印y ['class']计数时,您就一无所获。如果您已经正确添加了标签,请发送输出堆栈跟踪。