将数据划分为联合学习中的训练和测试



我是联合学习的新手我目前正在按照TFF的官方文档对一个模型进行实验。但我遇到了一个问题,希望能在这里找到一些解释。

我使用自己的数据集,数据分布在多个文件中,每个文件都是一个客户端(因为我计划构建模型(。并定义了因变量和自变量。

现在,我的问题是如何在联合学习中将数据划分为每个客户端(文件(中的训练集和测试集?就像我们通常在集中式ML模型中所做的一样到目前为止,我已经实现了以下代码:注意我的代码受到官方文档和这篇文章的启发,这篇文章与我的应用程序几乎相似,但它的目的是将客户端拆分为培训和测试客户端本身,而我的目的是拆分这些客户端内部的数据。

dataset_paths = {
'client_0': '/content/drive/MyDrive/Colab Notebooks/1.csv',
'client_1': '/content/drive/MyDrive/Colab Notebooks/2.csv',
'client_2': '/content/drive/MyDrive/Colab Notebooks/3.csv'
}
record_defaults = [int(), int(), int(), int(), float(),float(),float(),
float(),float(),float(), int(), int(),float(),float(),int()]
@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
return tf.data.experimental.CsvDataset(dataset_path,
record_defaults=record_defaults,
header=True)
@tf.function
def add_parsing(dataset):
def parse_dataset(*x):
## x defines the dependant varable & y defines the independant 
return OrderedDict([('x', x[-1]), ('y', x[1:-1])])
return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
source = tff.simulation.datasets.FilePerUserClientData(
dataset_paths, create_tf_dataset_for_client_fn) 
source = source.preprocess(add_parsing)
## Creat the the datasets from client data 
dataset_creation=source.create_tf_dataset_for_client(source.client_ids[0-2])
print(dataset_creation)
>>> _VariantDataset element_spec=OrderedDict([('x', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('y', (TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None)))])>
## Convert the x into array(I think it is necessary for spliting to training and testing sets ) 
test= tf.nest.map_structure(lambda x: x.numpy(),next(iter(dataset_creation)))
print(test)
>>> OrderedDict([('x', 1), ('y', (0, 1, 9, 85.0, 7.75, 85.0, 95.0, 75.0, 50.0, 6))])

我对监督ML的理解是将数据拆分为训练集和测试集,如下面的代码所示,我不确定如何在联合学习中做到这一点,也不确定它是否会以这种方式工作?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42) 

因此,请允许我对这个问题进行解释,以便我可以进入培训阶段。

请参阅本教程。您应该能够基于客户端及其数据创建两个数据集(训练和测试(:

import tensorflow as tf
import tensorflow_federated as tff
from collections import OrderedDict
record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]
@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
return tf.data.experimental.CsvDataset(dataset_path, record_defaults=record_defaults, header=True)

@tf.function
def add_parsing(dataset):
def parse_dataset(*x):
return OrderedDict([('label', x[:-1]), ('features', x[1:-1])])
return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)
def split_train_test(client_ids):
train, test = [], []
for x in client_ids:
d = source.create_tf_dataset_for_client(x)
d_length = d.reduce(0, lambda x,_: x+1).numpy()
d = d.shuffle(d_length)
train.append(list(d.take(int(d_length*.8)))) 
test.append(list(d.skip(int(d_length*.2))))
return train[0], test[0]
dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv', 
'client3': '/content/client2.csv', 'client4': '/content/client2.csv'}
source = tff.simulation.datasets.FilePerUserClientData(
dataset_paths, create_tf_dataset_for_client_fn) 
client_ids = sorted(source.client_ids)
federated_train_data, federated_test_data = split_train_test(client_ids)
print(*federated_train_data, sep='n')
(<tf.Tensor: shape=(), dtype=int32, numpy=24>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=float32, numpy=0.17308392>, <tf.Tensor: shape=(), dtype=float32, numpy=1.889401>, <tf.Tensor: shape=(), dtype=float32, numpy=1.6235029>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.56010467>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.0171211>, <tf.Tensor: shape=(), dtype=float32, numpy=0.43558818>, <tf.Tensor: shape=(), dtype=int32, numpy=40>, <tf.Tensor: shape=(), dtype=int32, numpy=14>)
(<tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=int32, numpy=32>, <tf.Tensor: shape=(), dtype=int32, numpy=14>, <tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.91828436>, <tf.Tensor: shape=(), dtype=float32, numpy=0.29887632>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4598584>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.1088414>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4057387>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.1537204>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=45>)
(<tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=float32, numpy=0.93560874>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.4382026>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.7638668>, <tf.Tensor: shape=(), dtype=float32, numpy=0.65431964>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.7130539>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.96356>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=18>)
(<tf.Tensor: shape=(), dtype=int32, numpy=42>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=34>, <tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=float32, numpy=0.3965425>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.2588629>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.84179455>, <tf.Tensor: shape=(), dtype=float32, numpy=0.114052325>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.9591451>, <tf.Tensor: shape=(), dtype=float32, numpy=0.94621265>, <tf.Tensor: shape=(), dtype=int32, numpy=28>, <tf.Tensor: shape=(), dtype=int32, numpy=7>)

如果您遵循我链接的教程,您应该能够将拆分数据直接馈送到tff.learning.from_keras_model

相关内容

  • 没有找到相关文章

最新更新