I've looked through the posts on StackOverflow and on MATLAB Answers and can't seem to find the answer I am looking for. I have a large CSV file (450 MB) with hex data that looks like this:
63C000CF,6000002F,603000AF,6000C06F,617300EF,6C7C001F,6000009F,0%,63C000CF...
That is a very truncated example, but basically I have approximately 78 different hex values separated by commas, then there will be the '0%', then 78 more hex values. This will continue for a very long time. I've been using textscan like this:
data = textscan(fid, '%s', 1, 'delimiter', '%');
data = textscan(data{1}{1}, '%s', 'delimiter', ',');
data = data{1};
count = size(data);
outstring = ['%', sprintf('\n')];
for idx = 1:count(1)
string = data{idx};
stringSize = size(string);
if stringSize(2) > 1
outstring = [outstring, string, sprintf('\n')];
end
end
fprintf(output_fid, '%s', outstring)
This allowed me to format the csv file in a way to which I could use fgetl() to analyze whether or not I was looking at the data I needed. Because the data repeats itself, I can use fseek() to jump to the next occurrence before calling fgetl() again.
What I need is a way to skip to the ending. I want to just be able to use something like fgetl() but have it only return the first hex value it encounters. I will know how many bytes to shift through the file. Then I need to make sure I can read other hex values. Is what I'm asking possible? My code using textscan above takes far too long on a csv file that is 90 MB let alone 450 MB.

댓글 수: 6

dpb
dpb 2014년 6월 4일
... have approximately 78 different hex values separated by commas, then there will be the '0%', then 78 more hex values
The "approximately" in there is a kicker--is the spacing consistent within a file but different files may have some other number or within a file? Could make all the difference...
Also, what, specifically, are you trying to return--a set of values a fixed offset apart or what?
Awaiting those answer, you may want to look at memmapfile
Adam Kaas
Adam Kaas 2014년 6월 4일
The approximate I used was basically because I wasn't sure if it was 78, but whatever the number is, it will be the same every time. And what I am doing is taking the hex value, converting it to binary, and analyzing the bits to interpret data. Each set of hex values represents a label of information, and they labels are always in the same order (773 bytes apart I believe) but sometimes the information changes. The information comes from a recording of data.
Hopefully this will help come to a good solution. I'll take a look at that function.
dpb
dpb 2014년 6월 4일
Still not clear...which, precisely are the values you're after? How do you define those of interest? Need the logic behind the process of retrieval, not the rationale of what to do with them when get 'em or how/where they came from...
Adam Kaas
Adam Kaas 2014년 6월 4일
The user will select which labels they want to analyze the data from. It can be one label, it can be all of them. I would analyze each label one at a time until all chosen labels have been analyzed.
dpb
dpb 2014년 6월 4일
I know you know what you're after, but we can only go by what is revealed here. Don't be so terse; over-explain rather than under-...
...Each set of hex values represents a label
What's a "set" in this context? A single value or all the values of a given offset relative to the beginning/the flag value? Or is it the entire group between the flag values?
Is "one label" above a single 16-bit hex value or again all of the same offset or the group at a the location of the indicated flag value? Have to have a precise definition of what it really is you're after.
How is/are the one(s) wanted identified?
What is the function of the indicator
Adam Kaas
Adam Kaas 2014년 6월 4일
편집: Adam Kaas 2014년 6월 4일
I apologize for not being thorough in my explanation.
A set of hex values represents an 8 character hex value (the values separated by commas), i.e. 63C000CF. One label is one set. We define them by the last two characters in the hex value, i.e. 63C000CF is label CF.
The labels chosen by the user are selected from a list of all available labels. This list is populated into a GUI in a separate function. Using the values from my example that list would be labels CF, 2F, AF, 6F, EF, 1F, and 9F. The user can, just as an example, select labels CF and AF and then I would need to go through the CSV file, find my first CF label and store the data contained, then move to the next CF (which will be a set number of bytes away in the file) and record that data until the end of the file is reached. Then I would repeat the process for the AF label.
If it is relevant, we do have names associated with the labels and don't actually refer to them as label CF. The label number is calculated in a strange way due to the way the data is transmitted, but essentially label CF would be label 363 (change CF to binary, flip it, that is the octal label). The user will know what kind of data is represented by that label.

댓글을 달려면 로그인하십시오.

 채택된 답변

Cedric
Cedric 2014년 6월 4일
편집: Cedric 2014년 6월 4일

3 개 추천

NEW solution
Here is a more efficient solution; I am using a 122MB file, so you have an idea about the timing
% One line for reading the whole file. To perform once only.
tic ;
content = fileread( 'adam_1.txt' ) ;
fprintf( 'Time for reading the file : %.2fs\n', toc ) ;
% One line for defining an extraction function. To perform once only.
extract = @(label) content(bsxfun( @plus, ...
strfind( content, [label,','] ).' - 6, ...
0 : 5 )) ;
% Then it is one call per label to extract data.
tic ;
data = extract( 'CF' ) ;
fprintf( 'Time for extracting one label: %.2fs\n', toc ) ;
Running this, I obtain
Time for reading the file : 0.52s
Time for extracting one label: 0.62s
FORMER solution
Would the following work for you?
% Read file content. To do once only.
content = fileread( 'myFile.txt' ) ;
% Define regexp-based extraction function. To do once only.
getByLabel = @(label) regexp( content, sprintf( '\\w{6}(?=%s)', label ), ...
'match' ) ;
% Get all entries for e.g. label 'CF'.
entries_CF = getByLabel( 'CF' ) ;
% Get all entries for e.g. label '6F'.
entries_6F = getByLabel( '6F' ) ;
I am not completely clear on what you need to achieve ultimately; if I had to design a GUI where users can choose a label and get corresponding data, I would process the data much further during the init phase, e.g. by grouping them by label in a cell array. Regexp is not the most efficient approach in this case I guess, but the principle would be..
labels = {'CF', '6F', 'AF', ..} ;
nLabels = numel( labels ) ;
data = cell{ 1, nLabels ) ;
for lId = 1 : nLabels
data{lId} = getByLabel( labels{lId} ) ;
end
and then when a user selects 'CF' ..
lId = strcmpi( label, labels ) ;
dataForThisLabel = data{lId} ;

댓글 수: 6

Adam Kaas
Adam Kaas 2014년 6월 4일
Cedric,
I understand the thought process you have on the lower half of your post. That is similar to what I was doing initially in regards to the fact that it will look at every hex value in the CVS file. I just feel like it would take too long similarly to how the textscan() function did previously. I'll look into regexp and see if I can make it work for my needs.
dpb
dpb 2014년 6월 4일
...in regards to the fact that it will look at every hex value in the CVS file...
That's where I think memmapfile can help.
Also, can you get the file produced as stream rather than text csv? That'd cut the size down significantly albeit your searching would be on a mask rather than characters.
I don't have time at the moment to play but could you attach a relatively short sample of the data file that could be used for testing ideas?
Cedric
Cedric 2014년 6월 4일
편집: Cedric 2014년 6월 4일
The regexp should work directly (no adaptation), so it is easy to time it. You just take my code, update the file name, and run it within tic; toc:
tic ;
content = fileread( 'myFile.txt' ) ;
fprintf( 'Time for reading the file: %.2fs\n', toc ) ;
getByLabel = @(label) regexp( content, sprintf( '\\w{6}(?=%s)', label ), ...
'match' ) ;
tic ;
entries_CF = getByLabel( 'CF' ) ;
fprintf( 'Time for extracting one label: %.2fs\n', toc ) ;
Cedric
Cedric 2014년 6월 4일
Please see new solution in the main answer.
Adam Kaas
Adam Kaas 2014년 6월 5일
Thanks Cedric! I've been playing with the regexp and it has been proving to be faster. I'll work on implementing your new solution. I appreciate your help!
Cedric
Cedric 2014년 6월 5일
My pleasure!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Text Data Preparation에 대해 자세히 알아보기

태그

질문:

2014년 6월 4일

댓글:

2014년 6월 5일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by