如何使用Apache Spark执行简单的网格搜索



我尝试使用Scikit Learn的GridSearch类来调整我的逻辑回归算法的超参数。

但是,即使并行使用多个作业,GridSearch 也需要几天的时间来处理,除非您只调整一个参数。我想过使用Apache Spark来加速这个过程,但我有两个问题。

  • 为了使用Apache Spark,您是否真的需要多台机器来分配工作负载?例如,如果您只有 1 台笔记本电脑,那么使用 Apache Spark 毫无意义吗?

  • 有没有一种简单的方法可以在Apache Spark中使用Scikit Learn的GridSearch?

我已经阅读了文档,但它谈到了在整个机器学习管道上运行并行工作线程,但我只想将其用于参数调优。

进口

import datetime
%matplotlib inline
import pylab
import pandas as pd
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.pylab as pylab
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn import datasets, tree, metrics, model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression, LinearRegression, Perceptron
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFE
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from datetime import datetime as dt
import scipy
import itertools
ucb_w_reindex = pd.read_csv('clean_airbnb.csv')
ucb = pd.read_csv('clean_airbnb.csv')
pylab.rcParams[ 'figure.figsize' ] = 15 , 10
plt.style.use("fivethirtyeight")
new_style = {'grid': False}
plt.rc('axes', **new_style)

算法超参数调优

X = ucb.drop('country_destination', axis=1)
y = ucb['country_destination'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)
knn = KNeighborsClassifier()
parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'], 
'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}

# ======== What I want to do in Apache Spark ========= #
%%time
parameters = {'n_neighbors': range(1, 100)}
clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)
best = clf1.best_estimator_
# ==================================================== #

您可以使用名为 spark-sklearn 的库来运行分布式参数扫描。您是对的,因为您需要一组计算机或一台多 CPU 计算机才能获得并行加速。

希望这有帮助,

鲁普 - Microsoft MMLSpark 团队

最新更新