Find all cycles (or the initial node of each cycle) in a very large directed graph (12m nodes, 6.7m edges)

조회 수: 7 (최근 30일)
Hi if anyone could help me out with this problem I have been stuck for a while and would be really grateful for your help,
Context
I have about 7 million transactions between accounts, where a transaction is a single row that describes who the buyer account and seller account are, and the object that was sold. So an example of one row would be: [B:0x3, S:0x1, O:Object1] or to better understand 0x1 sold Object1 to 0x3
I have concatenated the object to the buyer and seller aka 0x1-Object1 -> 0x3-Object1 (because an account can sell multiple different objects) and created a directed graph from this data to create a history of how all objects have transferred between accounts.
Problem:
I want a list of all the accounts where an object has returned back to the account it originally sold from, so for example:
If I had the transactions such that: Account1-Object1 -> Account2-Object1 -> Account1-Object1 then the output should be: Account1-Object1
because the asset eventually cycled back to Account1.
Current Non-Feasible Solution:
this is interpreted as: 0xc sold Object1 to 0x5, 0x5 sold Object2 to 0xc, 0xc sold Object3 to 0x0, 0x5 sold Object5 to 0x0, etc
My code so far:
%concatenate Object to buyer and seller
sales.BuyerAndObject = strcat(string(sales.Buyer),"-",string(sales.Object));
sales.SellerAndObject= strcat(string(sales.Seller),"-",string(sales.Object));
buyer = transpose(sales.BuyerAndObject);
seller = transpose(sales.SellerAndObject);
%Directed Graph: Seller -> Buyer
G = digraph(seller,buyer);
%I have then been trying to use allcycles(G) to find all the cycles,
cycles = allcycles(G)
then return the first node of each cycle, but the data is too massive as I have about 12 Million nodes and 6.7 MIllion edges and the time to completion is beyond reasonable. I am not great at graph theory and at a loss for how to go about what I need to do, and any help is greatly apreciated.

채택된 답변

Christine Tobler
Christine Tobler 2021년 8월 9일
편집: Christine Tobler 2021년 8월 9일
Have you tried using 'MaxNumCycles' to only compute the first 100 / 1000 / 1e4 cycles? That could give some indication if it's computing each cycle that's slower than acceptable for your problem, or if there are just very many cycles.
It might also be worthwhile to take a look at some of these cycles, to make sure they are all legitimate results (that is, perhaps you're seeing several cycles that can be recombined in different ways, which could cause there to be many more cycles returned than you'd expect).
  댓글 수: 7
Christine Tobler
Christine Tobler 2021년 8월 10일
Thank you for posting on this! It's always good to see comments on newer functions; I'll see if we can't make allcycles be faster for this case in a future release - we may not have thought of the case when a graph has millions of nodes but very few cycles when checking performance of this function.
Christine Tobler
Christine Tobler 2022년 5월 9일
Here is the bug report for this issue:
The performance problem in allcycles for very large graphs with few cycles was resolved for R2021b Update 1 and for R2022a from the initial release.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Graph and Network Algorithms에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by