convert files into matrix

hi,
I have 177000 files, I have to create matrix contain all values in these files.
Each file was split using textscan to get
c{1},c{2},........
then convert it into matrix.
Then convert these matrices into one matrix.
the problem is these files contain some similar values, so I have to specify the similar values ,and drew all other attached values(row) with these values.
I tried running with 100 files to know running time , I found out the running time is very long for just 100 files.
I think if I find function can compare among c{1}for all files, and among c{2} for all files ,...etc . I think that will save time. I'm facing problem with this code:
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
k=0;arr(:,:)=0; inc=0;k=0;y=1;
for i = 1: length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f=fopen(thisfilename,'r'); f1=fscanf(f,'%c'); f1(1:2)=[];
f2=fopen(thisfilename,'w'); fprintf(f2,'%c',f1);
f3=fopen(thisfilename,'r');
c = textscan(f,'%f %f %s','Delimiter',',','headerLines',1);
c1=c{1};c2=c{2}; c3=c{3};z=1;z1=1;z2=1;z3=0;
for k=1+k:length(c1)+inc
no=c1(z); arr1=arr(:,1); p=find(arr1==no);
if isempty(p)
j=1;
arr(y,j)=c1(z); arr(y,j+1)=i; arr(y,j+2)=c2(z);j=j+3;y=y+1;
else
ind(i,z1)=p;
L=arr(p,:);len=0;
for h=1:length(L)
if L(h)~=0
len=len+1;
end
end
len;
arr(p,len+1)=i;
arr(p,len+2)=c2(z);
z1=z1+1;
end
z=z+1;
end
inc=inc+length(c1);
[u,u1] =size(arr);
end
f4=fopen('netfile.txt','w');
for i=1:u
for j=1:u1
fprintf(f4,'%d ',arr(i,j));
end
fprintf(f4,'\n');
end
fclose all;
thanks

댓글 수: 1

huda nawaf
huda nawaf 2011년 11월 8일
please need advices about the above code.
may one can add some improvements to make it run easily .
please why it is running is very very slow.
how can improve it.
thanks in advance

댓글을 달려면 로그인하십시오.

 채택된 답변

Daniel Shub
Daniel Shub 2011년 11월 9일

0 개 추천

What version of MATLAB are you using? It looks like arr is growing in your loop. Prior to r2011a (???) preallocating a variable can speed things up. If you do not know the final size, reallocating in large chunks can speed things up.
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
Have you tried using the profiler to find bottlenecks in the code.

댓글 수: 3

Jan
Jan 2011년 11월 9일
Pre-allocation is extremly useful in 2011b also.
Daniel Shub
Daniel Shub 2011년 11월 9일
Is this a regression from 2011a to 2011b, or are the improvements in 2011a are not as great as I thought: http://blogs.mathworks.com/steve/2011/05/16/automatic-array-growth-gets-a-lot-faster-in-r2011a/
huda nawaf
huda nawaf 2011년 11월 9일
thanks Daniel,
What version of MATLAB are you using?
matlab7
It looks like arr is growing in your loop.
yes
Prior to r2011a (???) preallocating a variable can speed things up.
how do preallocate and reallocate
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
my files are stored in partition D:\ in my computer
Have you tried using the profiler to find bottlenecks in the code.
please tell me hoe use profile.
this code is very important for me.
thanks

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Jan
Jan 2011년 11월 9일

0 개 추천

Some general advices for improving the speed:
  • One command per line only - otherwise the JIT acceleration looses its power.
  • Avoid dump commands as "len;" - it wastes time.
  • Deleting the 1st two bytes from the file needs a lot of time. Better open the file, read two bytes and call TEXTSCAN afterwards.
  • Close every file as soon as possible properly by fclose(fid). Do not leave all files open until the final fclose('all'). Open files consume resources.
  • Use the vectorizing of fprintf. Instead of for j=1:u1, fprintf(f4,'%d ',arr(i,j)); end prefer fprintf(f4, '%d ', arr(i, :)).
  • Counting the number of non-zero elements in L does not need a loop. Faster: len = sum(L ~= 0);.
  • arr(:, :) = 0 is not useful, because it is equal to a = 0. k is defined twice.
I cannot insert a pre-allocation, because I do not know the maximal possible size of "arr". But this should be faster already:
function wwq
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
arr = 0; % Better pre-allocate
inc = 0;
kk = 0;
y = 1;
for i = 1:length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f = fopen(thisfilename,'r');
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
fclose(f);
c1 = c{1};
c2 = c{2};
% c3=c{3}; % Not used
z = 1;
% z1 = 1; % Not used
% z2 = 1; % Not used
% z3 = 0; % Not used
kknew = length(c1) + inc;
for k = (1 + kk):kknew % Avoid k as counter *and* in loop index
no = c1(z);
p = find(arr(:, 1) == no);
if isempty(p)
arr(y, 1) = c1(z);
arr(y, 2) = i;
arr(y, 3) = c2(z);
% j = j+3; % Not used
y = y + 1;
else
% ind(i,z1) = p; % Not used
L = arr(p, :);
len = sum(L ~= 0);
arr(p, len + 1) = i;
arr(p, len + 2) = c2(z);
% z1 = z1 + 1; % Not used
end
z = z + 1;
end
kk = kknew;
inc = inc + length(c1);
u = size(arr, 1);
end
f = fopen('netfile.txt','w');
for i = 1:u
fprintf(f, '%d ', arr(i, :));
fprintf(f,'\n');
end
fclose(f);

댓글 수: 2

huda nawaf
huda nawaf 2011년 11월 10일
thanks Jan,
I try to run the code you wrote it.
but in this part
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
c will return just the second line , i need read from second line to end line
thanks
huda nawaf
huda nawaf 2011년 11월 16일
hi jan
I tried your code, but the same problem.
I tried for just 1000 files, but the running time is very very long may 45 minutes for just 1000 files . what if I run 177000 files.
sparse matrix can solve this problem
thanks

댓글을 달려면 로그인하십시오.

카테고리

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by