hi guys , i want to read a text file line by line and remove the lines which have NA and the duplicated columns

조회 수: 1 (최근 30일)

이전 댓글 표시

chocho 2017년 2월 15일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/325176-hi-guys-i-want-to-read-a-text-file-line-by-line-and-remove-the-lines-which-have-na-and-the-duplica

편집: Walter Roberson 2017년 2월 20일

채택된 답변: dpb

COADREAD_methylation.txt

MATLAB Online에서 열기

d = fopen('COADREAD_methylation.txt','r');
this_line=0;
all={};
while this_line~=-1
 % C= textscan( d, '%f%s'  ) ;
    this_line=fgetl(d);
   if this_line~=-1
       all=[all;this_line];
   end
end
fclose(d);

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Stephen23 2017년 2월 17일

편집: Stephen23 2017년 2월 17일

채택된 답변

dpb 2017년 2월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/325176-hi-guys-i-want-to-read-a-text-file-line-by-line-and-remove-the-lines-which-have-na-and-the-duplica#answer_254913

편집: dpb 2017년 2월 16일

MATLAB Online에서 열기

Well, 'NA' is easy, not sure what defines the repeated columns; not enough time at present to try to parse that input file to figure out what is/isn't unique without a description being supplied...

fid = fopen('COADREAD_methylation.txt','r');
data={};
while ~feof(fid)
  l=fgetl(fid);
  if isempty(strfind(l,'NA')), data=[data;{l}]; end
end
fid=fclose(fid);

If the presence of 'NA' is all that's needed to get all the offending records, then you're done; otherwise need more details on how to tell so folks here don't have to try to work it out on their own.

댓글 수: 13
이전 댓글 11개 표시이전 댓글 11개 숨기기

dpb 2017년 2월 16일

편집: dpb 2017년 2월 17일

MATLAB Online에서 열기

Are such rows the only ones remaining that have a semi-colon in them? If so, finding them is the same as finding the 'NA' string except you now want to process the ones containing, not missing the found target instead of skipping them.

Breaking up the lines is again some pretty simple string processing; simply locate the semi-colon and the last tab-delimiter between the fields to retain the same input format.

>> l='cg00008493  0.987979722052904  "COX8C;KIAA1409"';
>> l=strrep(l,'"','');  % remove the superfluous "
>> idx=strfind(l,';');  % locate the semi-colon separator
>> itab=find(l==char(9),1,'last');  % and the last \t before 
>> l=[{l(1:ix-1)}; {[l(1:itab) l(ix+1:end)]}];  % build two lines from one
>> l
l = 
  'cg00008493  0.987979722052904  COX8C'
  'cg00008493  0.987979722052904  KIAA1409'
>>

NB: the result is a cellstr array as the lengths of the two substrings aren't the same; this is same "trick" used earlier in concatenating the lines while removing the unwanted lines.

chocho 2017년 2월 20일

편집: Walter Roberson 2017년 2월 20일

MATLAB Online에서 열기

hi friend, i want to make this code like this format

Note: i want to get every line and check if it has a NA remove it and get the second line, if not ckeck the columns of this line and see which column have ';' split this column and make 2 rows

fid = fopen('COADREAD_methylation.txt','r');
data={};
while ~feof(fid)
  l=fgetl(fid);   %get the lines
    if isempty(strfind(l,'NA')),  %remove NA rows
    else 
        %read next line
      idx=regexp(l,'\t','split');   %split the colmuns of this line which don't have NA and look for ';' in every column and split it 
      [nrow,ncol]=size(idx);  
           for i=1:ncol  
                 if idx(i)==';'  %look for columns which have ';'and split it 
                     split this column into 2 columns and put the second column
                     into a new row
                      %D = regexp(idx,';','split')
                      %l=[{l(1:idx-1)}; {[l(1:itab) l(idx+1:end)]}]; %split the line into 2
                 end
                     i=i+1;
           end
            save this line % this line will have no NA and if have ; will be splitted
      end
  end
  fid=fclose(fid);

chocho 2017년 2월 20일

편집: Walter Roberson 2017년 2월 20일

MATLAB Online에서 열기

inputs:

Hybridization REF  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05
Composite Element REF  Beta_value  Gene_Symbol  Chromosome  Genomic_Coordinate  Beta_value    Gene_Symbol
cg00000292  0.511852232819811  ATP2A1   16  28890100  0.787687855895422  ATP2A1
cg00002426  0.519102187746053  SLMAP    3  57743543  0.932889308560864  SLMAP
cg00006414  NA  "ZNF425;ZNF398"  7  148822837  NA  "ZNF425;ZNF398"  
cg00008493  0.987979722052904  "COX8C;KIAA1409"  14  93813777  0.986128428295584      "COX8C;KIAA1409"  
cg00011459  0.922491239231445  "TMEM186;PMM2"  16  8890425  0.961124285303233  "TMEM186;PMM2"

outputs:

Hybridization REF  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05
cg00000292  0.511852232819811  ATP2A1   0.787687855895422  
cg00002426  0.519102187746053  SLMAP       0.932889308560864  
cg00008493  0.987979722052904  COX8C     0.986128428295584      
cg00008493  0.987979722052904  KIAA1409  0.986128428295584        
cg00011459  0.922491239231445  TMEM186  0.961124285303233  
cg00011459  0.922491239231445  PMM2                0.961124285303233