Faster alternative to containers.Map

Question

Paolo Binetti 2017년 8월 31일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/354799-faster-alternative-to-containers-map

댓글: JS2018 2022년 10월 16일

Profiling a script (attached, along with a sample input data file), I have found that looking up a Map generated with containers.Map is the bottleneck. Namely the table is:

s = containers.Map(nodes, num2cell([1:numel(nodes)]'));

and the script looks it up within a while-loop a few thousands times:

idx = s(temp1); % same as above if s is a Map object

I have tried replacing the Map object with a data structure, but it did not seem to work, due to field name limitations. Are there other faster methods?

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Paolo Binetti 2017년 9월 1일

MATLAB Online에서 열기

Thank you. I have just tried it. Referring to my original code (attached to the question), you can create the hash table like this:

t = java.util.Hashtable;
for k = 1:numel(nodes)
    t.put(nodes{k}, k);
end

This takes much longer than creating a Map object, but maybe it can be sped up by vectorizing, if possible.

Then you can simply look the table up as follows:

idx = t.get(temp1);

Unfortunately this takes much longer than looking up the equivalent Matlab object.

Perhaps it would be faster using Python dictionaries in Matlab, but still have not figured out how.

Nikolaus Koopmann 2022년 3월 16일

Hi @Paolo Binetti,

have you found a solution??

i'm faced with a similar problem. Was going with containers.Map first but it didnt scale. java.uitl.HashTable is just as slow :(

cheers,

niko

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Walter Roberson 2017년 9월 1일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/354799-faster-alternative-to-containers-map#answer_279915

MATLAB Online에서 열기

fid = fopen('dataset_203_2.txt', 'rt');
approx_num_nodes = 3000;
used_nodes = 0;
known_nodes = nan(1, approx_num_nodes);
node_connections = cell(1, approx_num_nodes);
while true
  thisline = fgetl(fid);
  if ~ischar(thisline); break; end    %end of file
  toks = regexp(thisline, '^(?<src>\d+)\s*->\s*(?<dst>(\d+,\s*)*\d+)', 'names');
  src = str2double(toks.src);
  dst = str2double( regexp(toks.dst, ',\s*', 'split') );
    mentioned = [src, dst];
    [known, idx] = ismember(mentioned, known_nodes);
    unknown = ~known;
    num_new_nodes = nnz(unknown);
    newnodes = used_nodes+(1:num_new_nodes);
    known_nodes(newnodes) = mentioned(unknown);
    used_nodes = used_nodes + num_new_nodes;
    idx(unknown) = newnodes;
    node_connections{idx(1)} = idx(2:end);
  end
fclose(fid);
known_nodes = known_nodes(1:used_nodes);

At the end of this code, known_nodes will be a numeric list of node numbers from the file, in the order encountered, and node_connections will be a cell array of numeric vectors listing all of the connections. The connections listed will be in terms of the indices into the known_nodes list, not in terms of the original node numbers.

Another way of phrasing this is that the known_nodes is something would something you would use for the node labels, but the information in the node_connections list uses internal node numbers. It would be each to reconfigure this for text labels instead of numeric labels.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Walter Roberson 2017년 9월 1일

I did not do any timing tests on this. On my system it executed quickly on the test file.

Taking out the str2double() and fixing up the indexing to suit should speed it a little.

This code does have the disadvantage of calling ismember over and over again. I have it ismember against the complete nan-padded list, so it should take pretty much constant time per pass. In theory comparing against only the part of the array that has been used would make it faster, but in practice the extracting of the used subset would probably negate the gains. You would not want to build upwards with the list getting longer and longer, as then you end up doing reallocations every pass.

Paolo Binetti 2017년 9월 1일

편집: Paolo Binetti 2017년 9월 1일

Thank you. I also had tried a method based on repeated calls of ismember, but it was very slow. I was wondering if it could be improved by using undocumented ismembc2, but I am not sure it's a good approach.

댓글을 달려면 로그인하십시오.