Reduce the Size of Matrix

조회 수: 1 (최근 30일)
Isay
Isay 2014년 11월 22일
댓글: Stephen23 2014년 12월 1일
I need to Reduce my Matrices(Xbool_last and Xfreq_last) , because in the 1000th step of loop(it means docx=1000) , Matlab said : out of Memory!(loop is from 1 to 5549 !! )
please look at the part of code with information in it:
%%The loop for exploring ALL the documents to create the tf-idf weight matrix
for docx = 1 : length(DBlast)
docx
for word = 1 : length(DBlast{docx})
% In docx , we search all words in docx
word_xi = DBlast{docx}{word,1} ;
for docy = 1 : length(DBlast)
% While the source words are from docx search for them in
% the rest of documents
% if word_1i found in document i(=doc) vote 1
if sum(strcmpi(DBlast{docy},word_xi)) ~= 0
ind = find(strcmpi(DBlast{docy},word_xi) ~= 0) ;
Xbool(word,docy) = 1 ;
Xfreq(word,docy) = Freqlast{docy}(ind) ;
else
% else vote 0
Xbool(word,docy) = 0 ;
Xfreq(word,docy) = 0 ;
end
end
end
Xbool_last = [Xbool_last;uint8(Xbool)];
Xfreq_last = [Xfreq_last;uint8(Xfreq)];
Xbool = [] ;
Xfreq = [] ;
end
===============================================================================
So, questions: 1- how can i Reduce the size of Xbool_last and Xfreq_last? if i need to export Matrices TO .txt file (or something else) for Using it , How can I save them? or load them?
can you say the recommended code?
2. How can I use, the output of above code in tf-idf algorithm?(if you konw),
the tf-idf code is attached

채택된 답변

Guillaume
Guillaume 2014년 11월 22일
편집: Guillaume 2014년 11월 22일
You're already using uint8 to store your values. There isn't a smaller type unless you start packing booleans into bits which I assume is not possible for Xfreq_last anwyay. Using bits to store boolean is bound to be slow in matlab and awkward in matlab. There's no built-in function for that.
However, your storage looks incredibly inefficient to me. Say you're processing the first word of the first document. You find it in documents 2, 10, 150, 2048, 4125 for example. For a start, instead of storing those values (which woudln't take much memory, ~20 bytes as uint32), you store a boolean array of size 1x5549 (~5549 bytes) with only a few ones. But more importantly, in document 2, you're going to be looking for the exact same word, which you'll find in the exact same documents and store that again. Why?
Why not do the storage per word, instead of document, and for each word, just store which document it's found in?
  댓글 수: 1
Stephen23
Stephen23 2014년 12월 1일
+1 for excellent advice of data management.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Data Type Identification에 대해 자세히 알아보기

태그

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by