How do I keep updating and accumulating my arrays as I read multiple files one after the other

Question

Blaise 2013년 4월 19일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/72819-how-do-i-keep-updating-and-accumulating-my-arrays-as-i-read-multiple-files-one-after-the-other

So I have multiple m.files, and I have implemented my code which is able to read one file and do exactly what I need it to do.

However, I need to run this code over multiple files, all with different words in them, and I need to at the end of it find all existing words in ALL the files.

How can I do this.

This is my code so far

    fid=fopen('testing1.m')
out=textscan(fid, '%s', 'Delimiter', '\n');
out=regexp(lower(out{1}), ' ' , 'split');
fclose(fid)
    comb=unique([out{:}]);
    comb=comb(~cellfun('isempty', comb));
m=size(out,1)
idx=false(m,size(comb,2));
for j=1:m
idx(j,:)=ismember(comb,out{j});
if ismember('hello', out{j})
AL(j,:)=idx(j,:);
end
end
AL(all(AL==0,2),:)=[];
end

To open up my multiples files I use this

     for i=1:2
      fid=fopen(sprintf('testing%d.m',i))

When I use this to open up 2 files, I can't seem to make my code work because of the matrix dimension.

Any ideas on how to output a cell array AL, for two m.files testing1 testing2? I wanna create cell arrays which accumulates each time a file is being read.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Cedric 2013년 4월 19일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/72819-how-do-i-keep-updating-and-accumulating-my-arrays-as-i-read-multiple-files-one-after-the-other#answer_82862

편집: Cedric 2013년 4월 19일

MATLAB Online에서 열기

You could go for something like:

 words = {} ;
 for k = 1 : 2
   buffer = fileread(sprintf('testing%d.m', k)) ;
   words  = [words, regexp(buffer, '\w*', 'match')] ;   % Alphanumerical words.
 end
 uniqueWords = unique(words) ;
 % .. etc.

댓글 수: 13
이전 댓글 11개 표시이전 댓글 11개 숨기기

Blaise 2013년 4월 20일

편집: Blaise 2013년 4월 20일

It's a rather long process so here goes, so I'll try to say it in a few words.

So my aim is to identify similarities in dialogue act tags. I have conversations between people with a dialogue act tag and a sentence associated with it in each line.

Example

sd A.11 I think this looks good on you

qy B.12 Why do you think that ?

sd A.13 I know this is true

where sd, and qy are dialogue act tags, A and B are the speakers.

The cell array idx outputs a logical array that compares each sentence which contains 'sd' with the array that holds all of unique words, and outputs 1 if the string in the line containing 'sd' is also in the . cell array of unique words, and 0 otherwise. So at the end, I get all these matrices idx for the dialogue act sd, and I work out their mean.

I then, run the same code, while changing the condition, and looking only for sentences with DA qy. I do so, for 42 different DAs, so I'm basically going to be running the same code 42 times changing 'match' each time.

Then I get the mean for each DA tags, and I work out the cosine similarity between them all.

I'm going to add my code below, and send you a file or two. The files itself are not too big, its the matrices I create that are rather large.

Cedric 2013년 4월 21일

편집: Cedric 2013년 4월 21일

MATLAB Online에서 열기

I see :) your computation with idxsd doesn't scale well actually.

The first thing that you can do is to remove Asd which is strictly equivalent to idxsd, and prealloc the latter as you know its size (n_lines x n_words). You can also try to avoid storing the full idxsd but store counts, and get rid of ISMEMBER. I performed a few tests actually, and I compare each one of them with the output of your version of sdmean:

 % - Prealloc.
 tic ;
 idxsd2 = false(length(linessd), length(wordssd)) ;
 for j = 1 : length(linessd)
    idxsd2(j,:) = ismember(wordssd, linessd{j});
 end
 sdmean2 = mean(idxsd2) ;
 toc
 % - Store in vector; avoid array.
 tic ;
 idxsd3 = zeros(1, length(wordssd)) ;
 for j = 1 : length(linessd)
    idxsd3 = idxsd3 + ismember(wordssd, linessd{j});
 end
 sdmean3 = idxsd3 / length(linessd) ;
 toc
 % - Avoid ISMEMBER.
 tic ;
 idxsd4 = zeros(1, length(wordssd)) ;
 for j = 1 : length(linessd)
    pos = arrayfun(@(w)find(strcmp(w, wordssd), 1), linessd{j}) ;
    idxsd4(pos) = idxsd4(pos) + 1 ;
 end
 sdmean4 = idxsd4 / length(linessd) ;
 toc
 % - Avoid ISMEMBER and ARRAYFUN.
 tic ;
 idxsd5 = zeros(1, length(wordssd)) ;
 for j = 1 : length(linessd)
    for k = 1 : length(linessd{j})
        idxsd5 = idxsd5 + strcmp(linessd{j}{k}, wordssd) ;
    end
 end
 sdmean5 = idxsd5 / length(linessd) ;
 toc
 % Check.
 [all(sdmean2==sdmean), all(sdmean3==sdmean), ...
    all(sdmean4==sdmean), all(sdmean5==sdmean)]

Running this comparison outputs:

 Elapsed time is 0.475360 seconds.
 Elapsed time is 0.324377 seconds.
 Elapsed time is 0.316564 seconds.
 Elapsed time is 0.301044 seconds.
 Elapsed time is 0.121800 seconds.
 ans =
     1     1     1     0

which indicates that the double FOR loop is the fastest, BUT it differs from the previous outputs. The reason is that it counts all the occurrences of a word in a line, whereas ISMEMBER and the solution based on ARRAYFUN generate a unit increment even when there are multiple occurrences. We could correct the behavior of the 5th method so it matches the four previous ones, but I wanted, before doing that, to raise the question: "which behavior is correct from the point of view of the statistics that you want to compute"? Once you answer this, we can also think about using STRNCMP

Just as a side note: your matrix idxsd can be visualized with

spy(idxsd) ;

which is interesting qualitatively speaking.

Blaise 2013년 4월 21일

편집: Blaise 2013년 4월 21일

I have just run all your codes on a 100 files, and i get a different result for the first method with sdmean2, but all the others have the same mean, even the 5th method. idxsd2 is a 2012 by 3956 logical, whereas the others are a 1 by 6073 double

When I run your code on an example though, I see the difference between the 5th method and the rest. I don't think I have any multiple occurrences in the first 100 files which is why I couldn't see the difference. Is there any way to use the 5th method to make it similar to the others, it takes much less time.

I want the code to ignore multiple occurrences, if the string is in the sentence, I just want it to output 1, and the next time it reads the string in that same sentence, it shouldn't do anything.

And thank you for taking the time to show me all these various methods.

댓글을 달려면 로그인하십시오.

Answer 2

Blaise 2013년 4월 20일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/72819-how-do-i-keep-updating-and-accumulating-my-arrays-as-i-read-multiple-files-one-after-the-other#answer_82941

MATLAB Online에서 열기

nFiles = 1154 ;
match  = 'sd' ;
wordssd = {} ;
linessd = {} ;
for k = 1 : nFiles
fid     = fopen(sprintf('sw000%d.m', k), 'r') ;
contentsd = textscan(fid, '%s', 'Delimiter', '\n') ;
fclose(fid) ;
contentsd = regexp(lower(contentsd{1}), '\w*' , 'match') ;
wordssd  = unique([wordssd, [contentsd{:}]]) ;
id      = cellfun(@(line) any(strcmpi(match, line)), contentsd) ;
linessd   = [linessd; contentsd(id)] ;
end
for j=1:length(linessd)
idxsd(j,:)=ismember(wordssd, linessd{j});
Asd(j,:)=idxsd(j,:);
end
sdmean=mean(Asd);

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

How do I keep updating and accumulating my arrays as I read multiple files one after the other

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 13
이전 댓글 11개 표시이전 댓글 11개 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

How do I keep updating and accumulating my arrays as I read multiple files one after the other

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 13 이전 댓글 11개 표시이전 댓글 11개 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 13
이전 댓글 11개 표시이전 댓글 11개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기