Hi Matlab Central,
I am an inexperienced programmer looking to speed up the code I have. I know enough to go into profiler and look at what is taking a long time, and I think it is this bit here:
UniqueTFArray=unique(CombinedArray);
TFtable=zeros(size(AAA,1), length(UniqueTFArray));
for i=1:size(AAA, 1)
for j=1:length(UniqueTFArray)
TFtable(i,j)=~isempty(strfind(AAA.Regulator{i,1}, UniqueTFArray{1,j}));
end
end
TFSum=sum(TFtable);
figure; bar(TFSum);
AAA is a few thousand long, and UniqueTFArray is a few hundred, so the way I have it written, I think the profiler is telling me it gets called like 520,000 times so it is slow.
Now, I have a few ideas that I think could be of use.
Most of AAA.Regulator is empty, so length is 0. Should I put the strfind line in an if statement and only call it if the length is greater than 0? That would save time I think...
Or is there a fundamentally better approach?
Thank you very much!

 채택된 답변

Bjorn Gustavsson
Bjorn Gustavsson 2014년 1월 17일

0 개 추천

Maybe you'd get some speedup by using strfind on the entire AAA.Regulator cell array at once and then handle the result of those matches separately:
for i1 = 1:length(UniqueTFArray)
TempStrFindRes{i1} = strfind(AAA.Regulator, UniqueTFArray{1,i1});
end
for i2 = 1:length(TempStrFindRes)
TFtable(:,i2) = 1 - cellfun(@isempty,TempStrFindRes{i2});
end
If that's approximately what you want then at least it reduces the repeated overhead repeated calls to strfind would cause.
HTH

댓글 수: 2

Sarutahiko
Sarutahiko 2014년 1월 21일
편집: Sarutahiko 2014년 1월 21일
Bjorn, Walter, thank you for your previous responses. Something urgent came up here so I am just now finally getting back to this.
Here are the two vectors:
Here is a small snippet of AAA.Regulator ''
''
'FOS; TAF1; POLR2A'
'Pdx1; FOXI1; FOXD1; Foxq1; POLR2A; NR2F1'
'POLR2A'
'ZBTB7A; SPIB; ETS1; TRIM28; POLR2A'
'ZBTB7A; TRIM28; POLR2A'
'ZBTB7A; TRIM28; Hltf'
'MAX; CTBP2; TRIM28; FOXA1'
'MAX; EP300; CTBP2; FOXA1; TRIM28'
'MAX; EP300; CTBP2; FOXA1; TRIM28'
'MAX; EP300; CTBP2; FOXA1; TRIM28'
'MAX; EP300; CTBP2; FOXA1; TRIM28'
'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1'
'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1'
'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1'
'E2F1; SOX10; CTBP2; Hltf; FOXA1; TRIM28; NR3C1; USF1; BRCA1; POLR2A; MAX; EP300; SP1; GATA3; MYC'
'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; CTCF; TBP; USF1; BRCA1; POLR2A; MAX; RAD21; EP300; SP1; HEY1; JUND; TFAP2A; TFAP2C; NELFE'
'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; TBP; CTCF; USF1; POLR2A; MAX; RAD21; SP1; HEY1; JUND; TFAP2A; TFAP2C; NELFE'
'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; TBP; CTCF; POLR2A; RAD21; HEY1; JUND; TFAP2A; TFAP2C; NELFE'
'E2F1; CCNT2; TAF1; HEY1; TAF7; JUND; TFAP2A; CTCF; TFAP2C; NELFE; POLR2A'
'E2F1; CCNT2; TAF1; HEY1; TAF7; JUND; TFAP2A; CTCF; TFAP2C; NELFE; POLR2A'
'E2F1; CCNT2; TAF1; HEY1; TAF7; TFAP2A; TFAP2C; POLR2A'
'E2F1; TFAP2A; TFAP2C; POLR2A; TAF7'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'POLR2A'
'ZNF263; ZBTB7A; CEBPB; RFX5; CREB1; YY1; FOXA1; CTCF; ZNF143; SMC3; SUZ12; SIN3A; RAD21; JUND; Ddit3::Cebpa; TCF12'
'ZNF263; NFE2L1::MafG; ZBTB7A; CEBPB; RFX5; YY1; FOXA1; CTCF; ZNF143; SMC3;
SUZ12; GATA2; SIN3A; RAD21; JUND; TCF12'
'SUZ12; ZNF263; CTCF; Hand1::Tcfe2a'
'E2F1; ZNF263; CTCF'
'ZNF263; NR4A2; Mafb; ZEB1'
'E2F1; CTCF; E2F6; POLR2A'
'E2F1; CTCF; E2F6; POLR2A'
'E2F1; CTCF; E2F6; POLR2A'
'E2F1; CTCF; POLR2A'
'POLR2A' 'POLR2A'
'POLR2A'
'POLR2A'
''
''
''
''
''
''
''
''
''
''
''
'Mafb; Arnt::Ahr; CTCF; FOXC1'
'TFAP2A; CTCF'
'CTCF; ZEB1'
'CTCF; ZEB1'
'TBP'
'ARID3A'
''
''
'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; TBP; CTCF; ' 'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; PAX5; TBP; ' 'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1;' '' '' '' '' '' '' '' '' '' '' ''
UniqueTFArray is a 1x235 vector that looks like this:
E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; except it is 235 long if that makes sense.
I may not understand Bjorn's answer fully, but it seems to me like his code would only work for 1 cell of UniqueTFArray. So it would only produce a single vector. So I would have to call it 235 times, which is in effect what I already do. Am I missing something there?
Thanks very much for your help!
Have you tried my suggestion? Did it work? My simplistic test with guessed parameters indicated that the results were identical, but couldn't be bothered to make as large arrays to test with, so no timing of the 2 versions.
The difference is that your double loop calls strfind length(AAA.Regulator)*length(UniqueTFArray) number of times looking for a match. However strfind can handle cell-arrays of strings in its first input argument, so that's what I utilize. For each element in UniqueTFArray I look for matches in all of AAA.Regulator so my version only calls strfind length(UniqueTFArray) number of times. This produces a cell array of outputs that is then combined in the second loop. The tradeoff is that the number of calls to strfind is reduced in my version with the downside of having to call cellfun in the second loop instead. The question is which version causes more overheads...
HTH

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

Walter Roberson
Walter Roberson 2014년 1월 17일

0 개 추천

There is no point in searching for a string that will never be found.
Question: is AAA.Regulator only unique words, or are you ending up searching multiple times for some words?
Do I understand correctly that the point of the code is to count the number of times that each word of a corpus of words appears in each subset? And to check, are you looking for exact matches, a whole word matching a whole word, or are you looking for the case where the words in AAA.Regulator{i,1} appears anywhere within any of the words? For one thing, if you are looking for whole-word matches then you can "break" out of the "for j" loop as soon as a match occurs.

댓글 수: 3

Sarutahiko
Sarutahiko 2014년 1월 17일
편집: Sarutahiko 2014년 1월 17일
Hello,
Thank you very much for hearing me out. I will try to go point by point to answer your questions.
Yes, the point of the code is to search a short string for another string. Unique TF array contains the strings I want to match. There is only one word per cell, of course. AAA.Regulator contains a list of usually no more than 5 or so strings, if it contains any at all. However, I need the code to continue checking all subsequent cells for each string in the UniqueTF Array (in other words, it should not break out of the loop if a match occurs, it should catalog all such events).
I am looking for exact matches of a string, its actually a geneID not a whole word. Yes, the GeneID can appear anywhere in the string in AAA.Regulator, but they do appear in alphabetical order.
My best thought is to tell the code to skip all the empty cells in the array without evaluating the strfind and other commands in that line.
I would be happy to post a sample of AAA.Regulator and UniqueTF Array it would help.
In summary, the overall goal is to record every time one of my unique strings occurs in a cell within AAA.Regulator. Later in the code, these are summed so I can keep track of how many times each gene appeared.
Once again, thank you very much for your time and attention!
A sample AAA.Regulator entry and a sample entry from UniqueTF would help.
Sarutahiko
Sarutahiko 2014년 1월 21일
Walter - thanks again. I included the two vectors in the comment to Bjorn's answer, below.

댓글을 달려면 로그인하십시오.

Sarutahiko
Sarutahiko 2014년 1월 21일

0 개 추천

Bjorn, Walter, thank you for your previous responses. Something urgent came up here so I am just now finally getting back to this.
Here are the two vectors:
Here is a small snippet of AAA.Regulator '' '' 'FOS; TAF1; POLR2A' 'Pdx1; FOXI1; FOXD1; Foxq1; POLR2A; NR2F1' 'POLR2A' 'ZBTB7A; SPIB; ETS1; TRIM28; POLR2A' 'ZBTB7A; TRIM28; POLR2A' 'ZBTB7A; TRIM28; Hltf' 'MAX; CTBP2; TRIM28; FOXA1' 'MAX; EP300; CTBP2; FOXA1; TRIM28' 'MAX; EP300; CTBP2; FOXA1; TRIM28' 'MAX; EP300; CTBP2; FOXA1; TRIM28' 'MAX; EP300; CTBP2; FOXA1; TRIM28' 'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; EP300; CTBP2; GATA3; TRIM28; FOXA1; NR3C1' 'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; GATA3; FOXA1; TRIM28; NR3C1; MYC; BRCA1' 'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; SOX10; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; MAX; CTBP2; EP300; SP1; GATA3; TRIM28; FOXA1; NR3C1; MYC; BRCA1' 'E2F1; SOX10; CTBP2; Hltf; FOXA1; TRIM28; NR3C1; USF1; BRCA1; POLR2A; MAX; EP300; SP1; GATA3; MYC' 'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; CTCF; TBP; USF1; BRCA1; POLR2A; MAX; RAD21; EP300; SP1; HEY1; JUND; TFAP2A; TFAP2C; NELFE' 'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; TBP; CTCF; USF1; POLR2A; MAX; RAD21; SP1; HEY1; JUND; TFAP2A; TFAP2C; NELFE' 'CCNT2; E2F1; TAF1; CTBP2; E2F6; TAF7; TBP; CTCF; POLR2A; RAD21; HEY1; JUND; TFAP2A; TFAP2C; NELFE' 'E2F1; CCNT2; TAF1; HEY1; TAF7; JUND; TFAP2A; CTCF; TFAP2C; NELFE; POLR2A' 'E2F1; CCNT2; TAF1; HEY1; TAF7; JUND; TFAP2A; CTCF; TFAP2C; NELFE; POLR2A' 'E2F1; CCNT2; TAF1; HEY1; TAF7; TFAP2A; TFAP2C; POLR2A' 'E2F1; TFAP2A; TFAP2C; POLR2A; TAF7' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'ZNF263; ZBTB7A; CEBPB; RFX5; CREB1; YY1; FOXA1; CTCF; ZNF143; SMC3; SUZ12; SIN3A; RAD21; JUND; Ddit3::Cebpa; TCF12' 'ZNF263; NFE2L1::MafG; ZBTB7A; CEBPB; RFX5; YY1; FOXA1; CTCF; ZNF143; SMC3; SUZ12; GATA2; SIN3A; RAD21; JUND; TCF12' 'SUZ12; ZNF263; CTCF; Hand1::Tcfe2a' 'E2F1; ZNF263; CTCF' 'ZNF263; NR4A2; Mafb; ZEB1' 'E2F1; CTCF; E2F6; POLR2A' 'E2F1; CTCF; E2F6; POLR2A' 'E2F1; CTCF; E2F6; POLR2A' 'E2F1; CTCF; POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' 'POLR2A' '' '' '' '' '' '' '' '' '' '' '' 'Mafb; Arnt::Ahr; CTCF; FOXC1' 'TFAP2A; CTCF' 'CTCF; ZEB1' 'CTCF; ZEB1' 'TBP' 'ARID3A' '' '' 'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; TBP; CTCF; ' 'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; PAX5; TBP; ' 'E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1;' '' '' '' '' '' '' '' '' '' '' ''
UniqueTFArray is a 1x235 vector that looks like this:
E2F1; EGR1; TAF1; ELF1; CEBPB; YY1; FOXA1; except it is 235 long if that makes sense.
I may not understand Bjorn's answer fully, but it seems to me like his code would only work for 1 cell of UniqueTFArray. So it would only produce a single vector. So I would have to call it 235 times, which is in effect what I already do. Am I missing something there?
Thanks very much for your help!

카테고리

도움말 센터File Exchange에서 Data Type Identification에 대해 자세히 알아보기

제품

질문:

2014년 1월 17일

댓글:

2014년 1월 21일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by