Preserving node names in a digraph

Question

0 개 추천

I am constructing a large digraph with between 10k-100k nodes, in which I want to add, delete, and merge nodes. The nodes represent objects with other externally-stored data, which are indexed numerically, so the nodeIDs must be preserved to reference properly to the related data.

Is there a way of preserving node ids in a graph, other than giving the nodes string names?

In the following code

from_node=[1 1 2 3 4 4 5 6 7 3];
to_node=  [3 2 5 7 6 5 7 7 2 4];
weights=rand(size(from_node));
g=digraph(to_node, from_node, weights);
h=rmnode(g,2);

when you remove node 2, it will reorder the nodes and call some other node 2 unless you specify node names, which must be strings, as such:

from_node=[1 1 2 3 4 4 5 6 7 3];
to_node=  [3 2 5 7 6 5 7 7 2 4];
weights=rand(size(from_node));
names = cellstr(string(1:7));
g=digraph(to_node, from_node, weights,names);
h=rmnode(g,findnode(g,num2str(2)));

This is fine for small graphs, but for very large graphs that must be modified, this is extremely memory-inefficient, since you are forced to store a giant table of strings, which is redundant to your node id names.

Moreover, in this case you will need to do a findnode search each time that involves converting the number to a string, which could also be costly if done many many times.

Therefore, I am wondering if there is a more efficient way of preserving node ids upon insertion/deletion than using the names?

Thanks!!

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Walter Roberson 2018년 3월 1일

0 개 추천

Convert the node numbers to base 2^16. char() the result. Use those as the strings. For node names that are no larger than 100k then this takes two characters (4 bytes) each (plus any overhead from cell arrays.)

댓글 수: 2
없음 표시 없음 숨기기

Michael 2018년 3월 1일

How would you suggest doing this efficiently? dec2base only goes up to base 36 and I don't want to impose a strong computational load on this if I have to convert thousands of indices at once.

Thanks!

Walter Roberson 2018년 3월 1일

MATLAB Online에서 열기

Labels = char(reshape(typecast(uint32(Indices),'uint16').',2,[]).');

댓글을 달려면 로그인하십시오.

Answer 2

Christine Tobler 2018년 3월 1일

편집: Christine Tobler 2018년 3월 1일

MATLAB Online에서 열기

0 개 추천

Unfortunately, there is no direct way of doing this. The graph and digraph classes are designed to be fast when working on an existing graph, but this came at the cost of being relatively slow when adding and removing nodes one at a time.

To avoid having to convert the numbers to strings, you could construct and maintain two vectors which convert from the external indices to graph indices. For example like this:

maxExtInd = 1e6;
s = [1234 6543 765];
t = [6543 765 1234];
% graph2ext(indexIntoGraph) returns externalIndex
graph2ext = unique([s(:); t(:)]);
% ext2graph(externalIndex) returns indexIntoGraph 
%   (or zero if externalIndex is not in the graph)
ext2graph = sparse(maxExtInd, 1);
ext2graph(graph2ext) = 1:numel(graph2ext);
% Construct the graph:
g = graph(full(ext2graph(s)), full(ext2graph(t)));
graph2ext(g.Edges.EndNodes)
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]
% Add a node:
newNode = 456;
assert(ext2graph(newNode) == 0); % Check the node ID is not already in the graph
g = addnode(g, 1);
graph2ext(end+1) = newNode;
ext2graph(newNode) = numnodes(g);
figure;
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]
% Remove a node:
nodeToRemove = 1234;
graphNodeToRemove = ext2graph(nodeToRemove);
g = rmnode(g, graphNodeToRemove);
graph2ext(graphNodeToRemove) = [];
ext2graph(nodeToRemove) = 0;
ext2graph(ext2graph > graphNodeToRemove) = ext2graph(ext2graph > graphNodeToRemove) - 1;
figure;
plot(g, 'NodeLabel', graph2ext);
conversionTable = [find(ext2graph(:)), nonzeros(ext2graph)]

댓글 수: 2
없음 표시 없음 숨기기

Michael 2018년 3월 2일

MATLAB Online에서 열기

Thank you! I think this is a very nice solution. I'm wondering your thoughts about the trade-off between speed and memory in this particular situation.

In this case, we have to maintain the graph plus a 1xnum_nodes 8-byte double and 1xnum_nodes sparse filled with 8-byte doubles versus a 1xnum_nodes table column of 4-byte 1x2 char arrays. For 1000 entries, factoring in the overhead of the table object, I think it's a ~6.5 x memory savings to keep the table char array. However, we don't have to mess with conversions.

Do you think the search over the sparse/double array will be faster than doing the find_node of the proper node name?

I was able to improve on the speed of previous suggestion for conversion, assuming fewer than 2^32 entries using

char([floor(num./65536) rem(num,65536)])

to convert and

sum(double(cell2mat(nodenames)).*[65536 1],2)

to reverse, but there is certainly overhead using the findnode() functions within the digraph object and conversion to cell arrays of chars needed to use the digraph object.

You mentioned that it is optimized to be fast for operations but slow for manipulation and I see this to be true. When testing out my code, the biggest overhead is in adding an edge which calls expandTable(), which is very costly.

What is it about the table object that makes it optimal to design the graph object using it rather than just defining the nodes as a sparse and the edges as either a binary sparse or double sparse in the case of a weighted digraph? I'm very interested in what data structures are best for what jobs.

Thanks so much!

Christine Tobler 2018년 3월 5일

Hi Michael,

With the table char array, you should factor in not only the cost for each 4-byte char array, but also the additional mxArray header (which for each element of a cell array, specifies its datatype and additional information). Also, the sparse array indexing will do a binary search, which the graph object's findnode is not (currently) doing on its node names.

You're right about expandTable being the main overhead - if there are no node and edges properties (that is, if you store node names and edge weights separately during the loop), this overhead should decrease drastically.

The table object is not used to represent the structure of the graph object internally, we are using it only for the node and edge properties. There, it has the advantage of allowing the storage of properties of arbitrary datatypes in a simple manner. For cases where the graphs are modified many times, there is unfortunately a large overhead associated with the nodes and edges tables.

댓글을 달려면 로그인하십시오.

Preserving node names in a digraph

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2
없음 표시 없음 숨기기

추가 답변 (1개)

댓글 수: 2
없음 표시 없음 숨기기

카테고리

제품

태그

Community Treasure Hunt

Preserving node names in a digraph

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2 없음 표시 없음 숨기기

추가 답변 (1개)

댓글 수: 2 없음 표시 없음 숨기기

카테고리

제품

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 2
없음 표시 없음 숨기기