正在处理CSR数据文件



我有以npy格式存储的稀疏矩阵数据。我正在读取文件并以CSR格式存储它。下面的代码可以完美地工作:

def load_adjacency_matrix_csr(folder: str,
filename: str,
suffix: str = "npy",
row_idx: int = 1,
num_rows: int = None,
dtype: np.dtype = np.float32) -> sparse.csr_matrix:
coo_indices = np.load(os.path.join(folder, f"{filename}.{suffix}"))
rows = coo_indices[row_idx]
cols = coo_indices[1 - row_idx]
data = np.ones(len(rows), dtype=dtype)
num_rows = num_rows or rows.max() + 1
if num_rows < rows.max() + 1:
raise ValueError("The number of rows in the file is larger than the specified number of rows.")
csr_mat = sparse.csr_matrix((data, (rows, cols)), shape=(num_rows, num_rows), dtype=dtype)
return csr_mat

现在我读取相同的文件,将其写入新文件(_temp),然后读取。但是现在,当我从这个新文件(_temp)中读取时,我得到了一个错误Detected 1 oom-kill even(s)。在创建CSR矩阵时就会出现内存不足的情况。但是,当我能够直接从文件成功执行时,这是完全相同的行/列和数据数。我无法弄清楚写新文件时发生了什么变化(行/列和数据没有改变)。

def save_coo_matrix(npx: int,
folder: str,
filename: str,
suffix: str = "npy",
row_idx: int = 1,
dtype: np.dtype = np.float32) -> None:
file_path = os.path.join(folder, f"{filename}.{suffix}")
coo_indices = np.load(file_path)
rows = coo_indices[row_idx]
cols = coo_indices[1 - row_idx]
filename += "_temp"
file_path = os.path.join(folder, f"{filename}.npz")
np.savez(file_path, rows=rows, cols=cols)
def load_adjacency_matrix_csr(npx: int,
folder: str,
filename: str,
suffix: str = "npy",
row_idx: int = 1,
num_rows: int = None,
dtype: np.dtype = np.float32) -> sparse.csr_matrix:
file_path = os.path.join(folder, f"{filename}_padded.npz")
with np.load(file_path) as data:
rows = data["rows"]
cols = data["cols"]
num_rows = max(rows) + 1
num_cols = max(cols) + 1
values = np.ones(len(rows))
# fails here when creating the csr matrix (the num_rows is the same value as before)
csr_mat = sparse.csr_matrix((values, (rows, cols)), shape=(num_rows, num_cols))
return csr_mat

在写入和读取数据时使用的文件扩展名似乎不匹配。当您调用load_adjacency_matrix_csr时,请确保将后缀参数设置为csr_matrix = load_adjacency_matrix_csr(npx, folder, filename, suffix="npz", ...),并且在函数中不使用npx。您可以从函数定义中删除此参数。

最新更新