奇怪的内存消耗 C fread / C++ 读取函数,基于 Linux sysinfo 数据



Okey,我的程序有一个奇怪的(在我看来(行为,现在减少到仅从相当大的(大约 24GB 和 48 GB(二进制文件中读取 3 个数组。这些文件的结构非常简单,它们包含一个小标题,以及 3 个数组之后:int、int 和 float 类型,所有 3 个大小为 N,其中 N 非常大:2147483648 用于 28 GB 文件,4294967296 用于 48 GB 文件。

为了跟踪内存消耗,我使用了一个基于 Linux sysinfo 的简单函数来检测我在程序的每个阶段有多少可用内存(例如,在我分配数组来存储数据之后和读取文件时(。这是函数的代码:

#include <sys/sysinfo.h>
size_t get_free_memory_in_MB()
{
struct sysinfo info;
sysinfo(&info);
return info.freeram / (1024 * 1024);
}

现在直接进入问题:奇怪的部分是,在使用标准 C fread 函数或 C++ read 函数(根本不重要(从文件中读取 3 个数组中的每一个后,并检查我们在读取后有多少可用内存,我看到可用内存量大大减少(下一个示例大约减少了 edges_count * sizeof(int(。

fread(src_ids, sizeof(int), edges_count, graph_file);
cout << "1 test: " << get_free_memory_in_MB() << " MB" << endl;

所以基本上,根据 sysinfo 读取整个文件后,我的内存消耗几乎是预期的 2 倍。为了更好地说明问题,我提供了整个函数的代码及其输出;请阅读它,它很小,可以更好地说明问题。

bool load_from_edges_list_bin_file(string _file_name)
{
bool directed = true;
int vertices_count = 1;
long long int edges_count = 0;
// open the file
FILE *graph_file = fopen(_file_name.c_str(), "r");
if(graph_file == NULL)
return false;
// just reading a simple header here
fread(reinterpret_cast<char*>(&directed), sizeof(bool), 1, graph_file);
fread(reinterpret_cast<char*>(&vertices_count), sizeof(int), 1, graph_file);
fread(reinterpret_cast<char*>(&edges_count), sizeof(long long), 1, graph_file);
cout << "edges count: " << edges_count << endl;
cout << "Before graph alloc free memory: " << get_free_memory_in_MB() << " MB" << endl;
// allocate the arrays to store the result
int *src_ids = new int[edges_count];
int *dst_ids = new int[edges_count];
_TEdgeWeight *weights = new _TEdgeWeight[edges_count];
cout << "After graph alloc free memory: " << get_free_memory_in_MB() << " MB" << endl;
memset(src_ids, 0, edges_count * sizeof(int));
memset(dst_ids, 0, edges_count * sizeof(int));
memset(weights, 0, edges_count * sizeof(_TEdgeWeight));
cout << "After memset: " << get_free_memory_in_MB() << " MB" << endl;
// add edges from file
fread(src_ids, sizeof(int), edges_count, graph_file);
cout << "1 test: " << get_free_memory_in_MB() << " MB" << endl;
fread(dst_ids, sizeof(int), edges_count, graph_file);
cout << "2 test: " << get_free_memory_in_MB() << " MB" << endl;
fread(weights, sizeof(_TEdgeWeight), edges_count, graph_file);
cout << "3 test: " << get_free_memory_in_MB() << " MB" << endl;
cout << "After actual load: " << get_free_memory_in_MB() << " MB" << endl;
delete []src_ids;
delete []dst_ids;
delete []weights;
cout << "After we removed the graph load: " << get_free_memory_in_MB() << " MB" << endl;
fclose(graph_file);
cout << "After we closed the file: " << get_free_memory_in_MB() << " MB" << endl;
return true;
}

所以,没什么复杂的。直接输出(在//之后有一些评论(。首先,对于 24GB 文件:

Loading graph...
edges count: 2147483648
Before graph alloc free memory: 91480 MB 
After graph alloc free memory: 91480 MB // allocated memory here, but noting changed, why?
After memset: 66857 MB // ok, we put some data into the memory (memset) and consumed exactly 24 GB, seems correct
1 test: 57658 MB // first read and we have lost 9 GB...
2 test: 48409 MB // -9 GB again...
3 test: 39161 MB // and once more...
After actual load: 39161 MB // we lost in total 27 GB during the reads. How???
After we removed the graph load: 63783 MB // removed the arrays from memory and freed the memory we have allocated
// 24 GB freed, but 27 are still consumed somewhere
After we closed the file: 63788 MB // closing the file doesn't help
Complete!
After we quit the function: 63788 MB // quitting the function doesn't help too.

与 48GB 文件类似:

edges count: 4294967296
Before graph alloc free memory: 91485 MB
After graph alloc free memory: 91485 MB
After memset: 42236 MB
1 test: 23784 MB
2 test: 5280 MB
3 test: 490 MB
After actual load: 490 MB
After we removed the graph load: 49737 MB
After we closed the file: 49741 MB
Complete!
After we quit the function: 49741 MB

那么,我的程序内部发生了什么?

1( 为什么在读取过程中会丢失如此多的内存(使用 C 中的 fread 和来自 C++ 的文件流(?

2(为什么关闭文件不会释放消耗的内存?

3(也许sysinfo向我显示不正确的信息?

4(这个问题可以与内存碎片有关吗?

顺便说一下,我在一个超级计算机节点上启动我的程序,我在其上拥有独占访问权限(所以其他人无法影响它(,并且没有可以影响我的程序的附带应用程序。

感谢您阅读本文!

这几乎可以肯定是磁盘(/page(缓存。读取文件时,操作系统会将部分或全部内容存储在内存中,从而减少可用内存量。这是为了优化未来的读取。

但是,这并不意味着内存被进程使用或不可用。如果/当需要内存时,它将由操作系统释放并可用。

您应该能够通过跟踪 sysinfo 结构 (https://www.systutorials.com/docs/linux/man/2-sysinfo/( 中bufferram参数的值或通过查看运行程序前后free -m命令的输出来确认这一点。

有关此内容的更多详细信息,请参阅以下答案:https://superuser.com/questions/980820/what-is-the-difference-between-memfree-and-memavailable-in-proc-meminfo

最新更新