首先，我知道有很多高度相关的问题，但我的第一个实现(基于这些Q&Q的一些建议(不够高效。

我正在寻找一种方法来(显着(改进我的第一个实现，即从输入文本文件中读取带有字符串索引的巨大(>10000x10000(非对称非稀疏二维数组(矩阵(。还假设我们事先不知道矩阵的大小。

外部输入文件的结构(就像任意两个位置之间的距离矩阵(如下所示：

A   B   C   D   E   F   G
A 0   10  20  30  40  50  60
B 15  0   25  35  45  55  65
C 20  30  0   40  50  60  70
D 25  35  45  0   65  75  85
E 15  20  25  35  0   55  65
F 20  30  40  50  60  0   70
G 35  45  55  65  75  85  0

目前，我想出了以下解决方案：

std::map<std::string, std::map<std::string, int>> 
ReadDistancesFromFile(const char *name) {
std::string filename(name);
std::clog << "Trying to open and read: " << filename << std::endl;
std::ifstream file(name);
/// If .is_open() returns False, perror prints the error code stored in errno
if (!file.is_open())
std::perror(("Error while opening file " + filename).c_str());
/// Map of maps to save all read distances
std::map<std::string, std::map<std::string, int>> distances;
/* 1. Is such an efficient structure (container) for my purpose:
a) to store data efficiently
b) to access data using indices quickly?
c) to update values time after time
d) insertion/deletion of new elements doesn't happen often */
/// Vector to store all `String` type indices
std::vector<std::string> indices;
/// String to store index (location name)
std::string index;
/// Store line from the external file
std::string line;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `indices` vector
while (iss >> index) {
indices.push_back(index);
}
/* 2. Probably I could use .reserve() before the while loop?
The problem that I don't know the size in advance. */
/// Read the file via std::getline(). Rules obeyed:
///   - first the I/O operation, then error check, then data processing
///   - failbit and badbit prevent data processing, eofbit does not
while (std::getline(file, line)) {
std::istringstream is(line);
/* 3. Is it efficient to define a stringstream variable inside a loop? */
/// For each new line (matrix row), read the first String element (location name)
is >> index;
int distance;     // To store distance value
uint column = 0;  // Column number to access location names from `indices` vector
/// Process the line further: store Int distances from the input stream
while (is >> distance) {
distances[index][indices[column++]] = distance;
}
}
/// Only in case of set badbit we are sure that errno has been set
/// Use perror() to print error details
if (file.bad())
std::perror(("Error while reading file " + filename).c_str());
/// close file
file.close();
/// With C++11, std::map has move-semantics, which means the local map will be moved
/// on return and in some cases even the move can be elided by the compiler (RVO)
return distances;
}

首先，我在源代码中留下了三个问题作为注释。非常欢迎您的回答。
其次，目前，我使用~2000x2000的更小的输入文件做了一些最小的基准测试，它在我的中档MacBook Pro(2015年末(上花了大约~30秒。我相信这太长了(在我的情况下性能真的很重要(，如果您对如何改进此代码的想法不胜感激。

快速更新性能

阅读后，在琐碎的键的情况下，使用地图比unordered_map有什么优势吗？我决定用std::unordered_map替换std::map，其余部分保持不变。令我惊讶的是，这允许将执行时间(读取整个文件(减少~4-5倍，即从~30秒减少到~5-6秒。还不错！
然后，我根据 G. Sliepen 的答案 https://stackoverflow.com/a/57562007/3737891 修改了我的实现，即我用std::vector<int>替换了std::map<std::string, std::map<std::string, int>>，并将所有字符串索引保存在单独的std::unordered_map<std::string, size_t>类型容器中。使用这种方法，执行时间缩短到 ~1-2 秒 - 即，与初始方法相比至少快 15 倍！

矩阵的高效解析

最有效的方法是将值读入一维std::vector<int>。在第一行之后，您知道输入文件中的列数。最后，通过将向量的大小除以列数，您知道有多少行。然后，将向量重新解释为二维数组。

第一行可以用std::getline()读取，并使用std::istringstream解析。但是，所有其他行应该通过执行以下操作来解析：

int value;
file >> value;
distances.push_back(value);

当然，您需要忽略每行最左侧的列。

通过不逐行读取它，您可以避免将行转换为std::istringstream，这比直接从file解析值要慢。

std::vector<>将在必要时自动调整自身大小，以便添加到向量的末尾是摊销的 O(1( 操作。

最后，您将在向量中获得列乘以行值，如果您想访问行y的列x，那么您必须编写类似以下内容：

int desired_value = distances[x + y * columns];

按行名和列名访问矩阵元素

如果需要能够使用行和列的名称访问数据，则需要存储这些名称以及它们表示的索引。最有效的方法是将它们存储到std::unordered_map<>中，如下所示：

std::unordered_map<std::string, size_t> columns;
std::unordered_map<std::string, size_t> rows;
/// Read the first line containing all String indices (location names)
std::getline(file, line);
std::istringstream iss(line);
/// Process the first line: save all location names into `columns` map
std::string name;
size_t i = 0;
while (iss >> name)
columns[name] = i++;
/// Process other lines
...

然后，您可以获得给定row和column名称的距离，如下所示：

size_t x = columns[column];
size_t y = rows[row];
int desired_value = distances[x + y * columns.size()];

为什么地图的地图效率低下

映射作为平衡树实现。每当你想要插入一些东西时，它必须遍历树以找出插入新值的位置。一般来说，这需要O(log(N((时间。但是，如果您插入新值，使它们始终位于末尾，则需要经常重新平衡树，这会使其更慢。

此外，您的地图为每个值存储列名的副本，并为每行存储一个行名称的副本。因此，对于 10000 x 10000 个元素，您存储了一亿个字符串，其中许多字符串是相同的，并且您对这些字符串根本不感兴趣，只对它们表示的行或列索引感兴趣。

有效地从文本文件中读取带有字符串索引的大型二维数组(矩阵)

快速更新性能

矩阵的高效解析

按行名和列名访问矩阵元素

为什么地图的地图效率低下

相关内容

最新更新

热门标签：