Remove intermittent text when reading in a table from a .dat file

Question

L. Borealis 2021년 2월 19일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/750659-remove-intermittent-text-when-reading-in-a-table-from-a-dat-file

댓글: L. Borealis 2021년 2월 25일

Hi,

I am trying to use readtable for read in a .dat file. The file looks like this, where there could be 1 to very many entries in the columns that start with a "1'" here.

# NetMHCIIpan version 4.0
# Input is in PEPTIDE format
# Prediction Mode: EL+BA
# Threshold for Strong binding peptides (%Rank)	2%
# Threshold for Weak binding peptides (%Rank)	10%
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1   HLA-DPA10103-DPB10101      AAAAAAAAAAAAAAA    3   AAAAAAAAA     0.380        Sequence      0.020745    81.44       NA      0.366182        951.24    32.45       
   
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10201
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
 
   1   HLA-DPA10103-DPB10201      BBBBBBBBBBBBBBBB    2   BBBBBBBBB     0.960        Sequence      0.491911     1.02       NA      0.712020         22.55     0.27 <=SB
     
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10202
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1[.......]

These columns would then start 2,3,4,[...]. I successfully use

opts = detectImportOptions('filename.dat'); 
opts.DataLines = [16 Inf];
opts.VariableNamesLine = 14;
readtable(fullfile('path','filename.dat',opts,'ReadVariableNames', true);

for files with a large number of columns between the ----, i.e. e.g.

   # Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1   HLA-DPA10103-DPB10101      AAAAAAAAAAAAAAA    3   AAAAAAAAA     0.380        Sequence      0.020745    81.44       NA      0.366182        951.24    32.45       
   2   HLA-....
   3   ....
   ....
   ....
   50  HLA....
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------

However, this does not work for short "fillings" and my code very much depends on being robust in either scenario.

I tried playing with the opts but did not get it to work. I would be very grateful for any advice! Maybe a method other than readtable (readtext?) is needed and then a conversion to a table? In the end I will need a table like this:

Thank you very much for your advice! I have spent a long time deleoping the code around this and this is the final part that keeps breaking...

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Vimal Rathod 2021년 2월 22일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/750659-remove-intermittent-text-when-reading-in-a-table-from-a-dat-file#answer_630409

Hi,

Please refer to the following similar question which could be helpful to you.

How do I read data (from a .dat file) seperated by lines of text into individual vectors - MATLAB Answers - MATLAB Central (mathworks.com)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

L. Borealis 2021년 2월 25일

MATLAB Online에서 열기

Thanks, Vimal!

I had actually seen that question but even the question description was not particularly clear to me. So I had left it. Thanks for pointing me back to it. I did use part of it in the end to come up with a working (yet not elegant) solution. Maybe it is useful to someone in the future:

S = regexp(fileread('S:\scratch\cdr1pool2pep1\out_00.dat'), '\r?\n', 'split');
S = S(~cellfun('isempty',S));
if isempty(S{end}); S(end) = []; end    %regexp split leaves empty at bottom if file ended in \n which is common
nonheader = cellfun(@isempty, regexp(S, '^\s*#|^\s*-|^\s*P|^\s*N' ));  %permit space before #
starts = strfind([false nonheader], [false true]);
stops = strfind([nonheader false], [true false]);
num_blocks = length(starts);
lenRows = length(starts(1):stops(1));
S_temp = cell(num_blocks,lenRows);
for K = 1 : num_blocks
    S_temp(K,:) = S(starts(K):stops(K));
end
S = reshape(S_temp,[num_blocks*lenRows,1]);
writecell(S,'data.dat')
opts = detectImportOptions('data.dat');
tbl=readtable(('data.dat'),opts);

댓글을 달려면 로그인하십시오.

Remove intermittent text when reading in a table from a .dat file

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Remove intermittent text when reading in a table from a .dat file

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기