How to enhance the performance of for-loops and cell-arrays (related to statistical calculations)?

Question

Glazio 2017년 6월 5일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343417-how-to-enhance-the-performance-of-for-loops-and-cell-arrays-related-to-statistical-calculations

댓글: dpb 2017년 6월 8일

The following code calculates some performance measures out of different periodes. No error messages occured, but the processing time is very long.

F = 'runoff.txt'; % name of the file
D = 'C:\Users\heute\model\results\model_standalone\'; % absolute or relative path of base directory
S = dir(fullfile(D,'results*'));
X = [S.isdir] & ~ismember({S.name},{'.','..'});
N = {S(X).name};
L = cell(size(N));
C = cell(size(N));
for k = 1:numel(N)
    T = fullfile(D,N{k},F);
    fid = fopen(T,'rt');
    fmt = ['%s',repmat('%f',1,6)];
    opt = {'HeaderLines',1,'CollectOutput',true};
    Z = textscan(fid,fmt,opt{:});
    fclose(fid);
    L{k} = Z{1}; % timestamp
    C{k} = Z{2}; % data
    %        
    Qs = C{k}(:,6); % define the simulated runoff, as column 6 in each cell array
    %
    % define the periodes for computing performance measures
    sdatelim_neu = [datenum(2013,10,01,00,00,00) datenum(2016,10,01,00,00,00)];
    dt = 1/24;
    date = sdatelim_neu(1):dt:sdatelim_neu(2);
    date_runoff = transpose(date);
    %
    sdatelim1 = [datenum(2014,05,01,00,00,00) datenum(2014,10,01,00,00,00)];
    dt = 1/24;
    sdate_sdatelim1 = sdatelim1(1):dt:sdatelim1(2);
    %
    sdatelim2 = [datenum(2015,05,01,00,00,00) datenum(2015,10,01,00,00,00)];
    sdate_sdatelim2 = sdatelim2(1):dt:sdatelim2(2);
    %
    sdatelim3 = [datenum(2016,05,01,00,00,00) datenum(2016,10,01,00,00,00)];
    sdate_sdatelim3 = sdatelim3(1):dt:sdatelim3(2);
    %
    % loop over the different periodes
    for s = 1:length(sdate_sdatelim1);
        for a = 1:length(sdate_sdatelim2);
            for b = 1:length(sdate_sdatelim3);
                j = find(date_runoff >= sdate_sdatelim1(s) & date_runoff < sdate_sdatelim1(k)+dt) & find(date_runoff >= sdate_sdatelim2(a) & date_runoff < sdate_sdatelim2(a)+dt) & find(date_runoff >= sdate_sdatelim3(b) & date_runoff < sdate_sdatelim3(b)+dt);
                f_1k = 1-cov(Qs(j) - Qo)/var(Qo); %NSE
                f_2k = sqrt(mean((Qs(j) - Qo).^2)); %RMSE
                f_3k = abs(mean(Qs(j)- Qo)); %BIAS
                %Qo is the observed runoff -> imported from file
                %
                % write into matrix YA -> for use in further analysis
                YA = [f_1k, f_2k, f_3k];
            end
        end
    end
end

As a test case, I ran this code for two inputfiles (each of them has 26280 rows in column 6). In the end however several 1000 input-files should be processed.

How can I reduce the computing time?

or is there an error within the for-loop over the different periods? or is this:

Qs = C{k}(:,6); % define the simulated runoff, as column 6 in each cell array

an inefficient command?

(I use Matlab R2012a)

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

Glazio 2017년 6월 6일

편집: Glazio 2017년 6월 6일

@dpb:

I have some 1000 txt files that contain simulated runoff values. This files have all the same name and each file is saved within a subfolder called results and a consecutive number (that means results1, results2, results3,....).

The simulated runoff values are read from each txt-file and written to a cell-array (called C{k} = Z{2}).

Now, the task is to compute some performance measures (NSE, RMSE, BIAS) from the comparison of observed and simulated values. For each of the runoff.txt files one NSE, RMSE and BIAS should be calculated at the end.

Within a single file (runoff.txt), only certain periods should be considered to calculate the performance measures.

These periods are:

1. May 2014 to 30. September 2014
1. May 2015 to 30. September 2015
1. May 2016 to 30. September 2016

In the end the Matrix YA should contain 3 columns (NSE, RMSE, BIAS) and the number of rows should correspond to the number of processed runoff.txt-files (e.g. for 1000 files -> 1000 lines).

How can this be solved efficiently?

There are attached two example files:

1.) runoff.txt -> simulation results for one parameter combination 2.) runoff_observed.txt -> measurements of runoff

Glazio 2017년 6월 6일

@Walter Roberson: The code is trying to calculate the Root Mean Square Error, BIAS and NSE for each inputfile (runoff.txt) and should consider only certain periods for calculation.

The goal is a matrix YA which contains the performance measure combination for each runoff-file.

Glazio 2017년 6월 6일

@Stephen Cobeldick, thanks for your help.

What exactly do you mean with:

"It does not seem to be necessary store the data from all files, as you only seem to process the data from the current file." ?

In the end, all results should be in YA.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

dpb 2017년 6월 6일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/343417-how-to-enhance-the-performance-of-for-loops-and-cell-arrays-related-to-statistical-calculations#answer_269727

편집: dpb 2017년 6월 8일

MATLAB Online에서 열기

S = 'runoff.txt';
O = 'runoff_observed.txt';
D = 'C:\Users\heute\model\results\model_standalone\';
d = dir(fullfile(D,'results*'));                          % list of directories
fmtS = ['%{dd.MM.yyyy-HH:mm}D' repmat('%*f',1,5) '%f'];   % simulated format string
fmtO = ['%{dd.MM.yyyy HH:mm}D' %f'];                      % observed format string
L=length(d);                                              % number sudirs
YA=zeros(L,3);                                            % preallocate
for k = 1:L                                               % iterate over subdirs
  fid = fopen(fullfile(D,d{k}.name,S),'rt');              % open simulated
  Z=textscan(fid,fmtS,'headerlines',2,'collectoutput',1); % read simulated
  fclose(fid);
  dtS=Z{:,1};                                             % timestamp simulated (datetime)
  Qs=Z{:,2};                                              % simulated data
  fid = fopen(fullfile(D,d.name{k},O),'rt');              % open observed
  Z=textscan(fid,fmtO,'headerlines',1,'collectoutput',1); % read observed
  fclose(fid);
  dtO=Z{:,1};                                             % timestamp observed (datetime)
  Qo=Z{:,2};                                              % observed data
  % define the periods for computing performance measures
  yr1=2014; yr2=2016;                                     % years to compute over
output
  ix=isbetween(dtO,datenum(yr(1),05,01),datenum(yr(1),10,01)); % first year
  for yr=yr1+1:yr2  % subsequent years
    ix=ix | isbetween(dtO,datenum(yr,05,01),datenum(yr,10,01));
  end
  YA(k,:) = [f_1k, f_2k, f_3k];
end

ADDENDUM

Cleaned up to incorporate changes from conversation below excepting for opening the reference file--treat that as need to. Above should then return the L records in the output array.

ERRATUM

NB: Remove the (1) index from year reference yr in the loop to get the subsequent years after first...inadvertently left it in there when copied line.

댓글 수: 15
이전 댓글 13개 표시이전 댓글 13개 숨기기

Glazio 2017년 6월 8일

편집: Glazio 2017년 6월 8일

MATLAB Online에서 열기

Faith to have found a solution to my last comment.

Have the following code part:

yr1=2014; yr2=2016;                                     % years to compute over
  YA=zeros(yr2-yr1+1,3);                                  % preallocate for output
  idx=0;                                                  % output index
  for yr=yr1:yr2
    date1 = datenum(yr,05,01);                            % dates to consider
    date2 = datenum(yr,10,01);
    ix=isbetween(dtO,date1,date2);                        % inclusive dates logical vector
    f_1k = 1-cov(Qs(ix)-Qo(ix))/var(Qo(ix));              %NSE
    f_2k = rms(Qs(ix)-Qo(ix));                            %RMSE
    f_3k = abs(mean(Qs(ix)-Qo(ix));                       %BIAS
    % write into matrix YA -> for use in further analysis
    idx=idx+1;
    YA(idx,:) = [f_1k, f_2k, f_3k];
  end

slightly modified to:

Y1 = zeros(k,3);
    idx=0;                                                  % output index
    for yr=yr1:yr2
      date1 = datenum(yr,05,01,00,00,00);                   % dates to consider
      date2 = datenum(yr,10,01,00,00,00);
      %ix=iswithin(dto,date1,date2);                         % inclusive dates logical vector
      ix = iswithin(dto,date1,date2);
      f_1k = 1-cov(Qs(ix)-Qo(ix))/var(Qo(ix));              %NSE
      f_2k = rms(Qs(ix)-Qo(ix));                            %RMSE
      f_3k = abs(mean(Qs(ix)-Qo(ix)));                      %BIAS
      % write into matrix YA -> for use in further analysis
      idx=idx+1;
      Y1(idx,:) = [f_1k, f_2k, f_3k];
      Y2 = sum(Y1);
      end
      YA(k,:) = Y2;

The program now works much faster :-)

A general question:

Is it more meaningful to preallocate the output YA to a fixed known file size or to change it over k?

Glazio 2017년 6월 8일

@dpb: Thanks for your help and patience, now it seems as if the right results are delivered :-)

dpb 2017년 6월 8일

Well, not if you didn't make the fixup I noted above--it'll run but will process the first year three times as was.

MORAL: ALWAYS debug thoroughly; running w/o error doesn't guarantee correctness!

댓글을 달려면 로그인하십시오.

How to enhance the performance of for-loops and cell-arrays (related to statistical calculations)?

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

채택된 답변

댓글 수: 15
이전 댓글 13개 표시이전 댓글 13개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How to enhance the performance of for-loops and cell-arrays (related to statistical calculations)?

댓글 수: 7 이전 댓글 5개 표시이전 댓글 5개 숨기기

채택된 답변

댓글 수: 15 이전 댓글 13개 표시이전 댓글 13개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

댓글 수: 15
이전 댓글 13개 표시이전 댓글 13개 숨기기