只是想知道将大型数据文件读入MATLAB的最佳方式是什么?我目前正在以表格的形式阅读大的.txt文件,并根据它们的日期将它们组合在一起。我遇到的问题是MATLAB内存不足,我不确定解决这个问题的最佳方法。
我正在阅读的文件有结构化的标题,格式如下:
Phone timestamp;sensor timestamp [ns];channel 0;channel 1;channel 2;ambient
2021-03-04T19:58:47.117;536601230968690944;-332253;-317025;-322290;-641916;
2021-03-04T19:58:47.124;536601230976138752;-332199;-316980;-322281;-641938;
2021-03-04T19:58:47.131;536601230983586560;-332214;-316982;-322224;-641979;
2021-03-04T19:58:47.139;536601230991034368;-332200;-316973;-322191;-641939;
2021-03-04T19:58:47.146;536601230998482176;-332160;-316958;-322216;-641963;
好吧,首先在内存中加载整个文件内容通常是个坏主意,尤其是当它是一个非常大的文件时(它的整个内容甚至可能根本不适合可用的内存(。这样做是为了限制磁盘访问,或者当逐块处理文件是复杂的编码时,只需先获取所有文件,然后再进行处理(只要文件大小合理(。
另一个问题是,无论文件内容是以原始文件还是按块读取的,都需要保留所有"文件"吗;值";在文件内部作为单独的";值";也在记忆中?如果它们需要单独保存,不管怎样,都会耗尽内存。在一种情况下;忘记";关于数据,或者只在需要时重新加载其中的一部分,编码会变得更加复杂,但可以绕过大文件。
在您的情况下,假设文件内容是传感器值的实时采集,您只需要对它们进行平均即可减少内存占用。您可以执行fopen
和fgetl
以逐行获取其内容。请注意,虽然fgetl
是逐行的,但操作系统在内存中为您保留了一个缓冲区,所以每一行都没有磁盘访问权限。
下面是一个完整的例子:
- 我正在使用一个正则表达式来描述我在文件中查找的行类型
- 在
preallocateData
子功能中:- 我移到文件的开头
- 我一行一行地读,直到找到第一个有趣的
- 然后,我使用
fseek
快速移动,直到文件快结束 - 我一行一行地读,直到找到最后一个有趣的
- 根据我想要的第一次/最后一次读取时间戳和平均大小,我可以确定矩阵最终有多大,并为速度优化预先分配它
- 在
doAveraging
子功能中:- 我移到文件的开头
- 我逐行读取并累积值,直到时间戳差异大于我选择的平均持续时间
- 我将平均值存储在预先分配的数据集中
- 如果需要,我会修剪未使用的预分配块的数据
- 请注意:
- 即使传感器停止发送值一段时间(即时间戳中的大间隙(,代码也应该工作
- 代码可以在固定时隙从原点对数据进行平均,也可以在最后一次记录数据之后的一段时间内进行平均(请参阅
fixBinAveragingMode
( - 最终的时间戳列表可能不是线性的(尤其是当原始时间戳中有间隙或不使用
fixBinAveragingMode
时(。如果更实用的话,您可以在主函数的末尾添加对interp1
的调用,以在线性间隔的时间尺度上重新映射平均传感器值
%
% PURPOSE:
%
% Averages real-time sensor values.
%
% SYNTAX:
%
% [timeNs, data] = AveragingSensorData(filename, avgDurationNs);
%
% INPUTS:
%
% - 'filename': Text file containing real-time sensor data
% - 'avgDurationNs': Averaging duration (in nanoseconds)
%
% OUTPUT:
%
% - 'timeNs': (1 x ncount) vector representing time (in nanoseconds)
% - 'data': (ncount x 4) representing averaged sensor values
% * First column is channel0
% * Second cilumn is channel1
% * Third column is channel2
% * Fourth column is ambient
%
%% ---
function [timeNs, data] = AveragingSensorData(filename, avgDurationNs)
%[
if (nargin < 2), avgDurationNs = 1e9; end %0.01*1e9; end
if (nargin < 1), filename = '.data.txt'; end
% Regular expression pattern describing the type of line we are looking for in the file
% Means :
% - Start of line
% - Whatever except ';',
% - ';',
% - one-to-many-digits, (i.e timestamp)
% - ';'
% - one-to-many-digits (eventually prefixed with [+/-], (i.e. channel0)
% - ... etc ...
pattern = '^[^;]*;([0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);([+-]?[0-9]+);s*$';
% Minimal check
if (~(isnumeric(avgDurationNs) && isscalar(avgDurationNs) && isreal(avgDurationNs) && (avgDurationNs > 0)))
error('Are you''re kidding me ?');
end
% So first lets try opening the file for later line-by-line reading it
[fid, err] = fopen(filename, 'rt');
if (fid <= 0), error('Failed to open file: %s', err); end
cuo = onCleanup(@()fclose(fid)); % This will close the file all cases (normal termination, or exception, or even ctrl+c)
% Here based on number of lines in the files and averaging duration we
% estimate the final size of the data and preallocate them.
% NB: Quick exit is easy cases when there is 0 or single data line
[timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs);
if (canQuickExit), return; end
% Do the averaging really
fixBinAveraginMode = false; % Is averaging at fix or floating time position ?
[timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveraginMode, timeNs, data);
end
%% ---
function [timeNs, data, canQuickExit] = preallocateData(fid, pattern, avgDurationNs)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for first and last interesting lines in the file
% NB: This assumes timestamps are sorted in increasing order
nothingYet = true;
fastAndFurious = true;
firstReadTokens = []; lastReadTokens = [];
while(true)
% Read line-by-line until finding something interesting or eof
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
if (nothingYet) % It is the first time we found some interesting line
nothingYet = false;
firstReadTokens = tokens;
lastReadTokens = tokens;
if (fastAndFurious)
% Ok, don't bother reading each line, move almost to the
% end of file directly. NB: This can be risky if there is
% many empty lines at the end of the file, or if all lines
% are not of the same length
fseek(fid, -3 * numel(tline), 'eof');
end
else % This is not the first time
lastReadTokens = tokens;
end
end
% Conversion of matched tokens timestamps
firstReadTimestamp = []; lastReadTimestamp = [];
if (~isempty(firstReadTokens)), firstReadTimestamp = str2double(firstReadTokens{1}{1}); end
if (~isempty(lastReadTokens)), lastReadTimestamp = str2double(lastReadTokens{1}{1}); end
% Compute preallocation
if (isempty(firstReadTimestamp)),
% Easy, not a single line of data in the whole file
timeNs = zeros(1, 0);
data = zeros(0, 4);
canQuickExit = true;
elseif (isempty(lastReadTimestamp) || (abs(lastReadTimestamp - firstReadTimestamp) < 0.1)),
% Easy again, just one line of data in te whole file
timeNs = zeros(1, 1);
data = [str2double(firstReadTokens{1}{2}), str2double(firstReadTokens{1}{3}), str2double(firstReadTokens{1}{4}), str2double(firstReadTokens{1}{5})];
canQuickExit = true;
else
% Ok, lets allocate
estimateBlockCount = ceil((lastReadTimestamp - firstReadTimestamp) / avgDurationNs);
timeNs = zeros(1, estimateBlockCount);
data = zeros(estimateBlockCount, 4);
canQuickExit = false;
end
%]
end
%% ---
function [timeNs, data] = doAveraging(fid, pattern, avgDurationNs, fixBinAveragingMode, timeNs, data)
%[
% Go back to the beginning of the file
frewind(fid);
% Look for interesting lines till the end
% NB: We assume timestamps are sorted in increasing order in the file
idx = 0;
nothingYet = true;
while(true)
% Read line-by-line until finding something interesting
tline = fgetl(fid);
if (~ischar(tline)), break; end
tokens = regexp(tline, pattern, 'tokens');
if (~isscalar(tokens)), continue; end
lastReadTimestamp = str2double(tokens{1}{1});
if (nothingYet)
nothingYet = false;
idx = idx+1;
avgCount = 1;
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
elseif (lastReadTimestamp > nextStopTimestamp)
data(idx, :) = avg / avgCount;
idx = idx+1;
avgCount = 1;
if (fixBinAveragingMode)
% Fixed time slots from origin
offset = mod(lastReadTimestamp - timeNs(1), avgDurationNs);
timeNs(idx) = (lastReadTimestamp - offset);
nextStopTimestamp = timeNs(idx) + avgDurationNs;
else
% Run timer for averaging immediately after receiving data
timeNs(idx) = lastReadTimestamp;
nextStopTimestamp = lastReadTimestamp + avgDurationNs;
end
avg = [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
else
avgCount = avgCount + 1;
avg = avg + [str2double(tokens{1}{2}), str2double(tokens{1}{3}), str2double(tokens{1}{4}), str2double(tokens{1}{5})];
end
end
if (~nothingYet)
timeNs = timeNs - timeNs(1);
data(idx, :) = avg / avgCount;
end
% Trim unused preallocated data if required
timeNs((idx+1):end) = [];
data((idx+1):end, :) = [];
%]
end
代码也存储在GitHub上:平均在一个非常大的文件中收集的传感器值