Read text file lines and analyze

조회 수: 4 (최근 30일)
Lmm3
Lmm3 2017년 7월 24일
답변: OCDER 2017년 9월 9일
I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:
>Rosalind_4949
ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA
>Rosalind_7490
AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT
>Rosalind_8337
CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA
I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.
I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.
fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
templine = fgetl(fid);
a = strcmp(templine, '>');
if a == 0
G = length(strfind(templine,'G'));
C = length(strfind(templine,'C'));
z = length(templine);
%Per = (G+C)*100/z
end
end
Per = (G+C)*100/z

채택된 답변

Lmm3
Lmm3 2017년 9월 9일
The following code is what I used to read from the data file and determine %GC:
fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
templine = fgetl(fid);
a = strfind(templine, '>');
TF = isempty(a);
if TF == 1;
n= n+1;
G(1) = 0;
C(1) = 0;
z(1) = 0;
G(n) = length(strfind(templine,'G'));
C(n) = length(strfind(templine,'C'));
z(n) = length(templine);
G(n) = G(n) + G(n-1);
C(n) = C(n) + C(n-1);
z(n) = z(n) + z(n-1);
continue
% Per(n) = (G(n)+C(n))*100/z(n)
else TF == 0 ;
Per = (G(end)+C(end))*100/z(end)
disp(templine)
G(:,:) = [];
C(:,:) = [];
z (:,:)=[];
continue
end
end
Per =(G(end)+C(end))*100/z(end)

추가 답변 (2개)

KSSV
KSSV 2017년 7월 24일
편집: KSSV 2017년 7월 24일
Let data.txt be your text file...You can count the number of G in your file as below:
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
N = N+length(strfind(S{i}, 'G'));
end
Without loop :
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;
  댓글 수: 1
Lmm3
Lmm3 2017년 7월 25일
KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

댓글을 달려면 로그인하십시오.


OCDER
OCDER 2017년 9월 9일
If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).
Also, cellfun and regexp become pretty handy tools.
To get GC %:
[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
48.5876
45.0000
55.1724

카테고리

Help CenterFile Exchange에서 Workspace Variables and MAT Files에 대해 자세히 알아보기

태그

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by