我正在尝试使用多处理同时用不同的算法训练多个模型。例如,在同一数据集上并行训练朴素贝叶斯模型和随机森林模型。我正在使用concurrent.futures.ProcessPoolExecutor()来完成此操作,但我遇到了一个错误。
这是我的代码:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , f1_score , recall_score, precision_score
import time
import concurrent.futures
df = pd.read_csv(".DATAheart.csv")
X = df.iloc[:,:-1] # We get all but the output column
y = pd.DataFrame(df["output"]) # We get output column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.15 , random_state = 53)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
if __name__ == '__main__':
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
start = time.perf_counter()
# list of algorithms names
algo_name = [BernoulliNB, RandomForestClassifier, SVC, SGDClassifier]
# function to train the model and print the accuracy
def train(algo_name) :
model = algo_name().fit(x_train, y_train)
y_pred = model.predict(x_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("--------------------------------")
print("Accuracy: ", round(accuracy_score(y_test, y_pred)*100, 2), "%")
print("F1 accuracy: ", round(f1_score(y_test, y_pred)*100, 2), "%")
print("Precision: ", round(precision_score(y_test, y_pred)*100, 2), "%")
print("Recall : ", round(recall_score(y_test, y_pred)*100, 2), "%")
print("--------------------------------")
# run the train function with every item in the algo_name list parallel to each other
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(train, algo_name)
end = time.perf_counter()
print(f'Program runtime is {round((end - start) * 1000 , 2)} ms')
我得到的错误是:
Process SpawnProcess-2:
Process SpawnProcess-4:
Process SpawnProcess-1:
Process SpawnProcess-3:
值得一提的是,当我将concurrent.futures.ProcessPoolExecutor()更改为concurrent.futures.ThreadPoolExecutor()时,程序运行得很好,但没有比顺序运行程序更节省时间。
我稍微重写了一下,作为一个最小的可复制示例:
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time
import concurrent.futures
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
if __name__ == '__main__':
start = time.perf_counter()
algo_name = [BernoulliNB, RandomForestClassifier, SVC, SGDClassifier]
def train(algo_name) :
model = algo_name().fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model, accuracy_score(y_test, y_pred))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(train, algo_name)
end = time.perf_counter()
print(f'Program runtime is {round((end - start) * 1000 , 2)} ms')
得到输出:
BernoulliNB() 0.9196
SGDClassifier() 0.94
SVC() 0.9652
RandomForestClassifier() 0.9736
Program runtime is 1978.72 ms
在Ubuntu机器上运行Python3.10.6
和scikit-learn==1.2.0
scikit-learn有单独的方法来处理并行性(例如,请参阅关于并行性、资源管理和配置的文档),而可能会干扰。
sklearn.show_versions()
的系统详细信息,以防有用:
System:
python: 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
executable: /home/hayesall/miniconda3/envs/srlearn/bin/python
machine: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.2.0
pip: 22.2.2
setuptools: 65.5.0
numpy: 1.23.4
scipy: 1.9.3
Cython: None
pandas: 1.5.2
matplotlib: 3.6.2
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/hayesall/miniconda3/envs/srlearn/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/hayesall/miniconda3/envs/srlearn/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
version: 0.3.20
threading_layer: pthreads
architecture: Haswell
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/hayesall/miniconda3/envs/srlearn/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: Haswell
num_threads: 8