I need to identify the same lines between the two text files, mwithrm21.txt and virgomrmdist.txt, based on column 7 of each files. These matches should then be exported into a new text file, while removing the matched lines from mwithrm21.txt.
I have attached the text files.
I drafted the code below:
content1 = fileread( 'mwithrm21.txt' ) ;
content2_rows = strsplit( fileread( 'virgomrmdist.txt' ), sprintf( '\n' )) ;
found = cellfun( @(s)~isempty(strfind(content1, s)), content2_rows ) ;
output_rows = content2_rows(found) ;
fId = fopen( 'similarvclf.txt', 'w' ) ;
fprintf( fId, '%s\n', output_rows{:} ) ;
fclose( fId ) ;
output_rows = content2_rows(~found) ;
fId = fopen( 'mwithrm21_new.txt', 'w' ) ; % Remove the '_new' for overwriting original.
fprintf( fId, '%s\n', output_rows{:} ) ;
fclose( fId ) ;
But, I do not know how to make it specific to only searching column 7 and then exporting the entire matched line to a new text file.

댓글 수: 6

Cedric
Cedric 2015년 8월 15일
Did you understand my last comment in this thread?
jgillis16
jgillis16 2015년 8월 15일
Yes, I worked through it. But, how would I specify a specific column to look for matches in? Because in your example, it works off of finding a '~' from a specific column then extracting. Now, I need to find matches in column 7, which are name matches.
Cedric
Cedric 2015년 8월 15일
Actually it works off of finding the position of separators '|'. This allows to build an array of positions of separators for each row of the file. We do this precisely because you needed to target a specific column in your question. Based on this array, we can then test a specific column (whichever we want, as we have the potion of all separators). In the example, we test if column 4 contains '~' by testing all characters that follow directly all 3rd separators (per row).
Cedric
Cedric 2015년 8월 15일
I'll answer your question in about 10 minutes, but I have been insisting with this thread because you are doing file content processing which is increasingly complicated, so at one point you may need to understand quite well all the approaches that were involved in the answers to your questions so far.
Cedric
Cedric 2015년 8월 15일
편집: Cedric 2015년 8월 15일
Actually, you say that you want to match rows based on column 7 only, but when they match the other columns don't always match. What do you want to have in the output?
For example:
file1 : 188.83785|27.56214|-14.4|18.931|0.398|~|SDSSJ123521.05+273343.6
file2 : 188.83785|27.56214|18.931|0.398|-14.4|~|SDSSJ123521.05+273343.6
Should we export both?
jgillis16
jgillis16 2015년 8월 15일
I understand, which is why I worked through it (thanks for the reminder email!!).
It doesn't matter honestly, since the output material in the lines is the same, except in different order. But, since my main focus is to export lines matching in mwithrm21 to a new text file while removing the matched lines from the original mwithrm21 text file, I would like the exported lines to come from mwithrm21.txt

댓글을 달려면 로그인하십시오.

 채택된 답변

Cedric
Cedric 2015년 8월 15일
편집: Cedric 2015년 8월 16일

1 개 추천

Here is a first draft. Test it and let me know if anything is unclear or doesn't work.
% - Read files content as strings.
content1 = fileread( 'mwithrm21.txt' ) ;
content2 = fileread( 'virgomrmdist.txt' ) ;
% - Extract last column of each content.
codes1 = regexp( content1, '[^|]+(?=(\s|$))', 'match' ) ;
codes2 = regexp( content2, '[^|]+(?=(\s|$))', 'match' ) ;
% - Matches codes.
[isMatch_1in2, match_posIn2] = ismember( codes1, codes2 ) ;
% - Split content. Careful, whatever generates these files still uses
% carriage returns (\r) only.
rows1 = strsplit( content1, char(13) ) ;
rows2 = strsplit( content2, char(13) ) ;
% - Output matches (version mwithrm21.txt). Use new line chars (\n) as
% joint instead of carriage returns, change if you prefer \r.
fId = fopen( 'matches.txt', 'w' ) ;
fwrite( fId, strjoin( rows1(isMatch_1in2), '\n' )) ;
fclose( fId ) ;
% - Output non-matching rows of file 1.
fId = fopen( 'mwithrm21_reduced.txt', 'w' ) ;
fwrite( fId, strjoin( rows1(~isMatch_1in2), '\n' )) ;
fclose( fId ) ;
% - Output non-matching rows of file 2. Eliminate matching rows first.
rows2(nonzeros( match_posIn2 )) = [] ;
fId = fopen( 'virgomrmdist_reduced.txt', 'w' ) ;
fwrite( fId, strjoin( rows2, '\n' )) ;
fclose( fId ) ;
EDITs :
  • Replaced match_posIn2(match_posIn2~=0) with nonzeros( match_posIn2 ) after reading an answer by Matt J in another thread that mentions NONZEROS.

댓글 수: 7

To understand what we do with ISMEMBER, study a small example, e.g.
>> [isMatch_1in2, match_posIn2] = ismember( {'a', 'c', 'd', 'e', 'f'}, {'a', 'b', 'f'} )
isMatch_1in2 =
1 0 0 0 1
match_posIn2 =
1 0 0 0 3
Index exceeds matrix dimensions.
Error in mwithrm21virgomrmdist (line 16)
fwrite( fId, strjoin( rows1(isMatch_1in2), '\n' )) ;
Cedric
Cedric 2015년 8월 16일
편집: Cedric 2015년 8월 16일
It doesn't happen on my system with the files that you provided. Are you using the same files? What happens when you evaluate
>> size( rows1 )
ans =
1 4734
>> size( isMatch_1in2 )
ans =
1 4734
which one is different (or both)?
Cedric
Cedric 2015년 8월 16일
PS: in this other thread that I mentioned above, does my answer answer your question?
jgillis16
jgillis16 2015년 8월 16일
No, it was just my text files! Thanks!
Quick question, what if I needed to search for matches only restricted to the first column? I am using different text files, but this time I need to look for matches in the first column and then do the same thing we were doing in this thread.
Cedric
Cedric 2015년 8월 17일
편집: Cedric 2015년 8월 17일
Just replace the patterns in the calls to REGEXP with:
'(?<=(\s|^))[^|]+'
This means: match
  • One or more character (as many as possible) different from '|'.
  • Preceded by either a white space (which, in your case, is the carriage return of the previous line) or the beginning of the string.
PS1: if execution speed is important, you should profile Per's solution and mine, to see which one is the fastest in your specific case.
PS2: the previous patterns
'[^|]+(?=(\s|$))'
meant: match
  • One or more character (as many as possible) different from '|'.
  • Followed by either a white space (which, in your case, is the carriage return after column 7) or by the end of the string.
PS3:
  • [abc] is a set of characters to match (not a string, it means 'a' or 'b' or 'c').
  • [^abc] is a set of characters not to match.
  • + means "one or more (as many as possible) what precedes".
  • (?<=...) is a positive look-behind for whatever is between = and ).
  • (?=...) is a positive look-forward for whatever is between = and ).
  • (..|..) is an OR operator.
  • \s means "white space" which includes spaces, tabs, carriage returns, new lines characters.
  • ^ codes for the beginning of the string.
  • $ codes for the end of the string.
With that, you can decipher the patterns in principle.
jgillis16
jgillis16 2015년 8월 17일
OK! That was very helpful!!! Thanks!
Next time, I'll have a little more clue to what I need to code and maybe I might get it done myself without bugging you guys :)

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

per isakson
per isakson 2015년 8월 16일
편집: per isakson 2015년 8월 16일

2 개 추천

Here is an example of a different approach to solve the task. The two output files, mwithrm21_reduced.txt and matches.txt, are identical besides the new line characters.
function et = cssm()
% et(1) = cssm_1();
et(2) = cssm_2();
end
function et = cssm_2()
tic
fid = fopen( 'mwithrm21.txt', 'rt' );
rows1 = textscan( fid, '%s', 'Delimiter','\n' );
fseek( fid, 0, 'bof' );
codes1 = textscan( fid, '%*s%*s%*s%*s%*s%*s%s', 'Delimiter','|' );
fclose( fid );
%
fid = fopen( 'virgomrmdist.txt', 'rt' );
codes2 = textscan( fid, '%*s%*s%*s%*s%*s%*s%s', 'Delimiter','|' );
fclose( fid );
%
ism = ismember( codes1{1}, codes2{1} );
%
fid = fopen( 'matches.txt', 'wt' );
fprintf( fid, '%s\n', rows1{1}{ism} );
fclose( fid ) ;
%
fid = fopen( 'mwithrm21_reduced.txt', 'wt' );
fprintf( fid, '%s\n', rows1{1}{not(ism)} );
fclose( fid );
et = toc;
end

댓글 수: 1

jgillis16
jgillis16 2015년 8월 17일
Hey thanks! I appreciate the alternative approach!

댓글을 달려면 로그인하십시오.

r r
r r 2021년 5월 11일

0 개 추천

I have two files in which there are numbers in the first column that are similar and I want to print the line that matches and differs in the number of the first column in the two files:
%%%%%%%%%%%%%%%%%%%%%%% Fiel.1
fid1 = fopen( 'E1.txt', 'rt' );
T1 = textscan(fid1,'%s', 'delimiter', '\n');
%codes1 = textscan( fid1, '%*s%*s%*s%*s%*s%*s%s', 'Delimiter','|' );
fclose( fid1 );
%%%%%%%%%%%%%%%%%%%%%%%%%%Fiel.2
fid2 = fopen( 'G1.txt', 'rt' );
T2 = textscan(fid2,'%s', 'delimiter', '\n');
%codes2 = textscan( fid2, '%*s%*s%*s%*s%*s%*s%s', 'Delimiter','|' );
fclose( fid2 );
%%%%%%%%%%%%%%%%%%%%%%%%%%%
T1s = char(T1{:});
T2s = char(T2{:});
%Similar data between two files::
%[C,ix,ic] = intersect(T1s,T2s,'rows')
%Differences data between two files::
[B,ib,ib] = visdiff(T1s,T2s,'rows')
%%%%%%%%%%%%%%%%%%%%print output:::
fid = fopen( 'Similar.txt', 'wt' );%Print all similar lines
fprintf('%s\n',C)
fclose( fid ) ;
fid = fopen( 'Different.txt', 'wt' );%Print all different lines
fprintf('%s\n',B)
fclose( fid );

카테고리

도움말 센터File Exchange에서 Characters and Strings에 대해 자세히 알아보기

질문:

2015년 8월 15일

답변:

r r
2021년 5월 11일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by