sklearn.datasets.Load_files和numpy文件-他们一起很好吗?<



我在一个结构中有一个大的数据集合,该结构对应于上述函数sklearn.datasets.load_file。我想加载数据集并拟合一个基本的分类模型。我想这样的东西会适合这个任务:

import numpy as np
import sklearn.datasets
from sklearn.ensemble import RandomForestClassifier
dataset = sklearn.datasets.load_files("data", load_content = 'False') # my dataset cannot be loaded into the memory 
model = RandomForestClassifier(n_estimators=100)
model.fit(dataset.data, dataset.target)

但是我收到了一个错误:

ValueError: could not convert string to float: b'x93NUMPYx01x00vx00{'descr': '<f8', 'fortran_order': False, 'shape': (115000,), }                                                       nx00x00x00 xf2zY?x00x00x00x00xd8pp?x00x00x00@6xbcx88?x00x00x00@xad9e?x00x00x00xc0tx1ep?x00x00x00`x1exf9x8f?x00x00x00xe0!#q?x00x00x00`xb8#Sxbfx00x00x00@xb55x?x00x00x00 Jp}?x00x00x00 Pxdbrxbfx00x00x00@rxf8uxbfx00x00x00xc0fnX?x00x00x00`YI-?x00x00x00xc0xca~f?x00x00x00xa0xb2xe1Wxbfx00x00x00`x8axcdQxbfx00x00x00x80x97x1ecxbfx00x00x00xe0xe4xc1zxbfx00x00x00@xacCR?x00x00x00`nkt?x00x00x00`xeexf9pxbfx00x00x00x007/wxbfx00x00x00`exc4xxbfx00x00x00@xffx84{xbfx00x00x00xe08vkxbfx00x00x00 xd9x1dexbfx00x00x00xe0xe8YGxbfx00x00x00x80kxf1uxbfx00x00x00@Vxd8x91xbfx00x00x00 9xb1x8fxbfx00x00x00xe0fx9dL?x00x00x00@xa7xe4pxbfx00x00x00 xb4xc0~xbfx00x00x00xc0xb4xe4x83xbfx00x00x00xc0xef2x90xbfx00x00x00xe0x90]x86xbfx00x00x00@fxb6pxbfx00x00x00xc0Xxd0|xbfx00x00x00x00xaeqx8fxbfx00x00x00xc0xbaxd7x89xbfx00x00x00xe0mwx91xbfx00x00x00`[xb9x8fxbfx00x00x00@xa0xadx8bxbfx00x00x00`hxd3x94xbfx00x00x00xe0-cx86xbfx00x00x00xc0>9x82xbfx00x00x00xe0x90xbex91xbfx00x00x00xa0xcex17x8exbfx00x00x00xa0x01tx8fxbfx00x00x00xa0xac}x95xbfx00x00x00xe0x1ex0cx8fxbfx00x00x00xa0xdcxcbx90xbfx00x00x00xc0nx0fx96xbfx00x00x00xc0xbax8ax8bxbfx00x00x00`x10xe7x95xbfx00x00x00x00x1dsx9axbfx00x00x00 xbewx94xbfx00x00x00xa0xcflx94xbfx00x00x00x00Jx84x92xbfx00x00x00x80xcex8bx97xbfx00x00x00x80/|x99xbfx00x00x00xc0xd7x9ax99xbfx00

这样加载文件显然不知道如何处理NumPy文件。我们有什么选择?

我目前正在将所有NumPy文件转换为文本文件,但这会使数据量增加三倍或四倍。是否有一种不同的方式来加载材料,而不是基于保存为NumPy文件的矢量训练一个简单的模型?

我将创建一个数据加载器生成器,它将以小块加载数据。有了它,你应该使用允许partial_fit的模型,你可以分批训练。

我的流看起来像:


import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier


# very large file
FILE_PATH = ...
FEATURES_COLUMNS = ...
TARGET_COLUMN = ...
CHUNK_SIZE = 100_000

reader = pd.read_csv(FILE_PATH, chunksize=CHUNK_SIZE, low_memory=False)
clf = SGDClassifier(loss='log') # models with partial fit

for batch_number, dataf_chunk in enumerate(reader, start=1):

# logic to get X (features) and y (target) from data chunk
X, y = dataf_chunk[FEATURES_COLUMNS], dataf_chunk[TARGET_COLUMN]

# splits to track model performance
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, random_state=42, stratify=y)

# [custom] preprocessors that allow data update e.g HashingVectorizer
...

# model training per batches
clf.partial_fit(X_train, y_train, classes=np.unique(y_train))

print(f"Batch number {batch_number} | Model Scores:"
f"Train score = {clf.score(X_train, y_train) : .2%}|"
f"Test score = {clf.score(X_test, y_text) : .2%}")

在NumPy中也有与恰克读取等价的操作。参见在python和numpy中处理大数据,内存不足,如何将部分结果保存在磁盘上?

<标题>

更新:我发现scikit-learn文档中有一个使用yield部分训练示例的类似示例

相关内容

最新更新