读取大量数据的最佳方式

只是想知道将大型数据文件读入MATLAB的最佳方式是什么？我目前正在以表格的形式阅读大的.txt文件，并根据它们的日期将它们组合在一起。我遇到的问题是MATLAB内存不足，我不确定解决这个问题的最佳方法。

我正在阅读的文件有结构化的标题，格式如下：

Phone timestamp;sensor timestamp [ns];channel 0;channel 1;channel 2;ambient
2021-03-04T19:58:47.117;536601230968690944;-332253;-317025;-322290;-641916;
2021-03-04T19:58:47.124;536601230976138752;-332199;-316980;-322281;-641938;
2021-03-04T19:58:47.131;536601230983586560;-332214;-316982;-322224;-641979;
2021-03-04T19:58:47.139;536601230991034368;-332200;-316973;-322191;-641939;
2021-03-04T19:58:47.146;536601230998482176;-332160;-316958;-322216;-641963;

好吧，首先在内存中加载整个文件内容通常是个坏主意，尤其是当它是一个非常大的文件时(它的整个内容甚至可能根本不适合可用的内存(。这样做是为了限制磁盘访问，或者当逐块处理文件是复杂的编码时，只需先获取所有文件，然后再进行处理(只要文件大小合理(。

另一个问题是，无论文件内容是以原始文件还是按块读取的，都需要保留所有"文件"吗；值"；在文件内部作为单独的"；值"；也在记忆中？如果它们需要单独保存，不管怎样，都会耗尽内存。在一种情况下；忘记"；关于数据，或者只在需要时重新加载其中的一部分，编码会变得更加复杂，但可以绕过大文件。

在您的情况下，假设文件内容是传感器值的实时采集，您只需要对它们进行平均即可减少内存占用。您可以执行fopen和fgetl以逐行获取其内容。请注意，虽然fgetl是逐行的，但操作系统在内存中为您保留了一个缓冲区，所以每一行都没有磁盘访问权限。

下面是一个完整的例子：

我正在使用一个正则表达式来描述我在文件中查找的行类型
在preallocateData子功能中：
- 我移到文件的开头
- 我一行一行地读，直到找到第一个有趣的
- 然后，我使用fseek快速移动，直到文件快结束
- 我一行一行地读，直到找到最后一个有趣的
- 根据我想要的第一次/最后一次读取时间戳和平均大小，我可以确定矩阵最终有多大，并为速度优化预先分配它
在doAveraging子功能中：
- 我移到文件的开头
- 我逐行读取并累积值，直到时间戳差异大于我选择的平均持续时间
- 我将平均值存储在预先分配的数据集中
- 如果需要，我会修剪未使用的预分配块的数据
请注意：
- 即使传感器停止发送值一段时间(即时间戳中的大间隙(，代码也应该工作
- 代码可以在固定时隙从原点对数据进行平均，也可以在最后一次记录数据之后的一段时间内进行平均(请参阅fixBinAveragingMode(
- 最终的时间戳列表可能不是线性的(尤其是当原始时间戳中有间隙或不使用fixBinAveragingMode时(。如果更实用的话，您可以在主函数的末尾添加对interp1的调用，以在线性间隔的时间尺度上重新映射平均传感器值

%
% PURPOSE:
%
%   Averages real-time sensor values.
%
% SYNTAX:
%
%   [timeNs, data] = AveragingSensorData(filename, avgDurationNs);
%
% INPUTS:
%
%   - 'filename': Text file containing real-time sensor data
%   - 'avgDurationNs': Averaging duration (in nanoseconds)
%
% OUTPUT:
%
%   - 'timeNs': (1 x ncount) vector representing time (in nanoseconds)
%   - 'data': (ncount x 4) representing averaged sensor values
%             * First column is channel0
%             * Second cilumn is channel1
%             * Third column is channel2
%             * Fourth column is ambient
%
%% ---
function [timeNs, data] = AveragingSensorData(filename, avgDurationNs)
%[
if (nargin < 2), avgDurationNs = 1e9; end %0.01*1e9; end
if (nargin < 1), filename = '.data.txt'; end
% Regular expression pattern describing the type of line we are looking for in the file
% Means : 
%   - Start of line
%   - Whatever except ';', 
%   - ';', 
%   - one-to-many-digits, (i.e timestamp)
%   - ';' 
%   - one-to-many-digits (eventually prefixed with [+/-], (i.e. channel0)
%   - ... etc ...
pattern = '^[^;]*;([0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);s*$';
% Minimal check
if (~(isnumeric(avgDurationNs) && isscalar(avgDurationNs) && isreal(avgDurationNs) && (avgDurationNs > 0)))
error('Are you''re kidding me ?');
end
% So first lets try opening the file for later line-by-line reading it 
[fid, err] = fopen(filename, 'rt');
if (fid <= 0), error('Failed to open file: %s', err); end
cuo = onCleanup(@()fclose(fid)); % This will close the file all cases (normal termination, or exception, or even ctrl+c)
% Here based on number of lines in the files and averaging duration we
% estimate the final size of the data and preallocate them.
% NB: Quick exit is easy cases when there is 0 or single data line
[timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs);
if (canQuickExit), return; end
% Do the averaging really
fixBinAveraginMode = false; % Is averaging at fix or floating time position ?
[timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveraginMode, timeNs, data);   
end
%% ---
function [timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for first and last interesting lines in the file
% NB: This assumes timestamps are sorted in increasing order
nothingYet = true;
fastAndFurious = true;
firstReadTokens = []; lastReadTokens = [];
while(true)
% Read line-by-line until finding something interesting or eof
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
if (nothingYet) % It is the first time we found some interesting line
nothingYet = false;
firstReadTokens = tokens;
lastReadTokens = tokens;            
if (fastAndFurious)
% Ok, don't bother reading each line, move almost to the
% end of file directly. NB: This can be risky if there is
% many empty lines at the end of the file, or if all lines
% are not of the same length
fseek(fid, -3 * numel(tline), 'eof');                
end            
else % This is not the first time
lastReadTokens = tokens;
end
end
% Conversion of matched tokens timestamps
firstReadTimestamp = []; lastReadTimestamp = [];
if (~isempty(firstReadTokens)), firstReadTimestamp = str2double(firstReadTokens{1}{1}); end
if (~isempty(lastReadTokens)), lastReadTimestamp = str2double(lastReadTokens{1}{1}); end     
% Compute preallocation
if (isempty(firstReadTimestamp)), 
% Easy, not a single line of data in the whole file
timeNs = zeros(1, 0);
data = zeros(0, 4);        
canQuickExit = true;
elseif (isempty(lastReadTimestamp) || (abs(lastReadTimestamp - firstReadTimestamp) < 0.1)), 
% Easy again, just one line of data in te whole file        
timeNs = zeros(1, 1);
data = [str2double(firstReadTokens{1}{2}), str2double(firstReadTokens{1}{3}), str2double(firstReadTokens{1}{4}), str2double(firstReadTokens{1}{5})];
canQuickExit = true;
else
% Ok, lets allocate
estimateBlockCount = ceil((lastReadTimestamp - firstReadTimestamp) / avgDurationNs);
timeNs = zeros(1, estimateBlockCount);
data = zeros(estimateBlockCount, 4);
canQuickExit = false;
end
%]
end
%% ---
function [timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveragingMode, timeNs, data)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for interesting lines till the end
% NB: We assume timestamps are sorted in increasing order in the file
idx = 0;
nothingYet = true;    
while(true)
% Read line-by-line until finding something interesting
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
lastReadTimestamp = str2double(tokens{1}{1});
if (nothingYet) 
nothingYet = false;
idx = idx+1;
avgCount = 1;
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
elseif (lastReadTimestamp > nextStopTimestamp)    
data(idx, :) = avg / avgCount;
idx = idx+1;
avgCount = 1;
if (fixBinAveragingMode)
% Fixed time slots from origin
offset = mod(lastReadTimestamp - timeNs(1), avgDurationNs);
timeNs(idx) = (lastReadTimestamp - offset);
nextStopTimestamp = timeNs(idx) + avgDurationNs;
else
% Run timer for averaging immediately after receiving data
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
end            
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
else
avgCount = avgCount + 1;
avg = avg + [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
end
end
if (~nothingYet)
timeNs = timeNs - timeNs(1); 
data(idx, :) = avg / avgCount;
end
% Trim unused preallocated data if required
timeNs((idx+1):end) = [];
data((idx+1):end, :) = [];           
%]
end

代码也存储在GitHub上：平均在一个非常大的文件中收集的传感器值

相关内容

最新更新

热门标签：