我有一个图,由4个矩阵定义:x
(节点特征(、y
(节点标签(、edge_index
(边列表(和edge_attr
(边特征(。我想用这张图在Pytorch Geometric中创建一个数据集,并执行节点级分类。似乎由于某种原因,仅仅将这4个矩阵包装到data
对象中是失败的。
我创建了一个包含属性的数据集:
Data(edge_attr=[3339730, 1], edge_index=[2, 3339730], x=[6911, 50000], y=[6911, 1])
表示图形。如果我尝试切片这个图,比如:
train_dataset, test_dataset = dataset[:5000], dataset[5000:]
我得到错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-feb278180c99> in <module>
3 # train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
4
----> 5 train_dataset, test_dataset = dataset[:5000], dataset[5000:]
6
7 # Create dataloader for training and test dataset.
~/anaconda3/envs/py38/lib/python3.8/site-packages/torch_geometric/data/data.py in __getitem__(self, key)
92 def __getitem__(self, key):
93 r"""Gets the data of the attribute :obj:`key`."""
---> 94 return getattr(self, key, None)
95
96 def __setitem__(self, key, value):
TypeError: getattr(): attribute name must be string
我在数据构建中做错了什么?
对于节点分类:
创建自定义数据集。
class CustomDataset(InMemoryDataset):
def __init__(self, root, transform=None, pre_transform=None):
super(CustomDataset, self).__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])
@property
def raw_file_names(self):
return ['edge_list.csv', 'x.pt', 'y.pt', 'edge_attributes.csv']
@property
def processed_file_names(self):
return ['graph.pt']
def process(self):
data_list = []
edge_list = pd.read_csv(self.raw_paths[0], dtype=int)
target_nodes = edge_list.iloc[:,0].values
source_nodes = edge_list.iloc[:,1].values
edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.int64)
x = torch.load(self.raw_paths[1], map_location=torch.device('cpu'))
y = torch.load(self.raw_paths[2], map_location=torch.device('cpu'))
# make masks
n = x.shape[0]
randomassort = list(range(n))
random.shuffle(randomassort)
max_train = floor(len(randomassort) * .1)
train_mask_idx = torch.tensor(randomassort[:max_train])
test_mask_idx = torch.tensor(randomassort[max_train:])
train_mask = torch.zeros(n); test_mask = torch.zeros(n)
train_mask.scatter_(0, train_mask_idx, 1)
test_mask.scatter_(0, test_mask_idx, 1)
train_mask = train_mask.type(torch.bool)
test_mask = test_mask.type(torch.bool)
edge_attributes = pd.read_csv(self.raw_paths[3])
data = Data(edge_index=edge_index, x=x, y=y, train_mask=train_mask, test_mask=test_mask)
print(data.__dict__)
data, slices = self.collate([data])
torch.save((data, slices), self.processed_paths[0])
然后在列车循环中,在更新模型时使用遮罩。
def train():
...
model.train()
optimizer.zero_grad()
F.nll_loss(model()[data.train_mask], data.y[data.train_mask]).backward()
optimizer.step()
不能对torch_geometric.data.Data
进行切片,因为它的__getitem__
定义为:
def __getitem__(self, key):
r"""Gets the data of the attribute :obj:`key`."""
return getattr(self, key, None)
所以看起来你们不能用__getitem__
访问边缘。但是,由于您要做的是拆分数据集,因此可以使用torch_geometric.utils.train_test_split_edges
。类似于:
torch_geometric.utils.train_test_split_edges(dataset, val_ratio=0.1, test_ratio=0)
它将:
将
Data
对象的边拆分为正和负train/val/test边,并向返回的Data
对象添加以下属性:train_pos_edge_index
、train_neg_adj_mask
、val_pos_edge_index
、val_neg_edge_index
、test_pos_edge_index
和test_neg_edge_index
。