- Don't use a text file, go binary
- Split your text file in manageable chunks beforehand.
- Use a database instead
이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
Data is not saving to the workspace
조회 수: 13 (최근 30일)
이전 댓글 표시
Aaron Smith
2017년 2월 10일
I have a large text file composed of a single row of 52480000 numbers separated by semicolons. I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024. The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) I have tried using a while and if loop.
R = 51250;
C = 1024;
fid = fopen( 'TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = fprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, 1025, []), ';')
end
end
fclose(fid);
This code does not create an initial cell of 52480000 numbers, which means that none of the subsequent data sets (s & z) are created in the workspace. The problem is that if I textscan the data into Matlab before formatting it, the file creates a memory error. Does anyone notice anything that I don't about this code or have any pointers?
댓글 수: 26
José-Luis
2017년 2월 10일
편집: José-Luis
2017년 2월 10일
What is the size of that file? If the numbers had been stored in a binary file in double precision, that would still be more than 400MB. A text file is bound to be much larger and despite impressive progress GB files are a pain to process.
There are several ways of tackling this. Off the top of my head:
There are other ways but I can't be more specific without knowing what you are trying to achieve.
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
See earlier question:
"I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024"
Why do you need this intermediate step?
My answer showed you how to to simply process exactly those blocks of 1025*1024, avoiding that intermediate matrix entirely. What do gain by creating that huge matrix that you don't even want? My code shows how you can go directly to the smaller matrices (which seems to be your aim) without having to read the whole file data into MATLAB and without needing to use the intermediate step of rearranging all of the data into one pointlessly huge matrix.
Why not just read the blocks you need (1025*1024) instead of wasting time and memory with that huge matrix?
"The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) "
Yes, and that is what my answer does. Change R = 51250; back to R = 1025; and this code will work too.
Aaron Smith
2017년 2월 10일
Like I said, using your code, There is no output data. z and s do not appear in the workspace, and when I made alterations that did give s and z in the workspace, they were empty cells
Aaron Smith
2017년 2월 10일
The same problem occurs with that R value. I changed the values, hedging my bets but it didn't make a difference to the result
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
@Aaron Smith: when my code works properly then the contents of Z will be empty at the end of all iterations. What were you expecting?
>> size(Z)
ans =
1 1
>> size(Z{1})
ans =
0 1
Much more interesting would be the value of k: please tell me what value k has.
Aaron Smith
2017년 2월 10일
z is 1 in my workspace and k is also 1. There is now an error occurring with reshape: Error using reshape Product of known dimensions, 1025, not divisible into total number of elements, 1.
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
"z is 1" z is actually a cell array, so it cannot be equal to one. What do you really mean?
textscan is not reading the data file. Possibly the format is not as expected. Do the numbers have decimal digits, or exponent notation? Please run this and tell me exactly what values out has (it will be slow):
fid = fopen('file.txt','rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid,1e5));
out = union(out,double(tmp));
end
fclose(fid);
disp(out)
And also show exactly what this displays:
fid = fopen('file.txt','rt');
str = fgets(fid,60)
fclose(fid);
Aaron Smith
2017년 2월 10일
>> fid = fopen( 'TEST_A.asc', 'rt');
>> out = [];
>> while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
>> fclose(fid);
>> tmp
tmp =
067
>> out
out =
10 48 49 50 51 52 53 54 55 56 57
The data are all integers between 0 and 1000, though some may be over 1000. I just haven't been able to spot any numbers over 800. The file does have over 50 million numbers though.
fid = fopen( 'TEST_A.asc', 'rt' );
>> str = fgets(fid, 60)
str =
1
>> fclose(fid);
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
@Aaron Smith: the file contains newline characters (char 10), which means your original description of the file format "I have a very large text file composed of, in essence one row of numbers." is incorrect. Also your original question had code where you used textscan with semicolon delimiter. But there is not one single semicolon in the whole file.
As a result that code tells textscan to read a file with a particular format, but it is not the format that that file has. Because I wrote that code based on what you told me.
You can either experiment with textscan's options (e.g. EndOfLine, Delimiter, etc) yourself, or you can tell us exactly what format the file really has. If you want help then please upload a sample text file (the first two thousand numbers or so) in a new comment.
Aaron Smith
2017년 2월 10일
The file did not unzip correctly today so the file was not correct. I downloaded it again and unzipped it again
>> fid = fopen( 'TEST_A.asc', 'rt' );
str = fgets(fid, 60)
fclose(fid);
str =
1;658;671;661;686;672;662;645;654;669;675;650;688;666;664;66
This is the other test code you wrote
>> fid = fopen( 'TEST_A.asc', 'rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
fclose(fid);
tmp
out
When I tried the original code with the newly properly unzipped file
R = 1025;
C = 1024;
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
fclose(fid);
This gave an output for z which was a 1025 x 1 cell. This cell is the first row
Aaron Smith
2017년 2월 10일
R = 1025;
C = 1024;
opt = { 'EndofLine', ';', 'CollectOutput', true};
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, opt{:});
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
Error using reshape
Product of known dimensions, 1025, not divisible into total number of elements, 1.
I tried it again and got a different error
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
@Aaron Smith: What is k's value when you get that error?
You have been asked twice to upload a sample file. It will be difficult to help your further without it.
I know my code works: I tested it. I even gave you the code that I used to generate the fake data file. If there is any problem then it is because your data file does not match the expected format somehow. So we need to see it.
Could it be that the number of values in the file is not divisible by 50*1025 ? If so then you might need a special case to handle the last matrix. Again, knowing the value of k and a sample file would be helpful.
Stephen23
2017년 2월 10일
편집: Stephen23
2017년 2월 10일
@Aaron Smith: Try this, it saves all blocks of 1025x1024 values in their own files, and if there are any values left over at the end it saves them in one row in new file:
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine',';', 'CollectOutput',true};
fid = fopen(fullfile(sbd,'temp0.txt'),'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid,'%d', R*C, opt{:});
S = fullfile(sbd,sprintf('temp0_%02d.txt',k));
if rem(numel(Z{1}),R)==0
dlmwrite(S,reshape(Z{1},[],R).',';')
else
dlmwrite(S,Z{1},';')
end
end
fclose(fid);
Note that I also added a transpose to get the data in the correct order.
Aaron Smith
2017년 2월 13일
편집: Aaron Smith
2017년 2월 13일
Thanks so much Stephen. I got an error on the code but i think it might be a problem with the file itself or save path
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine', ';', 'CollectOutput', true};
fid = fopen(fullfile( sbd, 'TEST_A.asc' ), 'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_A_A.asc', k ));
if rem(numel( z{1}), R)==0
dlmwrite(S, reshape( z{1}, [], R).', ';')
else
dlmwrite( S, z{1}, ';')
end
end
fclose(fid);
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier
I'm sure I'll be able to fix that. What does sbd do? Is it system build which builds the blocks or does it make fullfile create separate files for the blocks rather than build a full file from parts the way fullfile usually does or is it just the temporary name of the files?
Walter Roberson
2017년 2월 13일
sbd is the name of the subdirectory to save the individual files into. You can set it to '' if you do not want to use a subdirectory to store them
Stephen23
2017년 2월 13일
편집: Stephen23
2017년 2월 13일
sbd = 'tempDir';
is a subdirectory of the current directory. I put all of the files into this subdirectory because I did not want them cluttering up my current directory. You can make the subdir '' if you want to use the current directory, or (even better) learn to use directory paths and put your data in its own subdirectory.
Aaron Smith
2017년 2월 13일
Yeah, I worked that out from reading pages on Matlab and by writing a description of the code. Thanks guys. Any idea what the problem with the file identifier might be? It came up before and seemed to just go away after a few times typing it out. That hasn't worked this time. It isn't the save path or the file name that is causing the problem as far as i know
Stephen23
2017년 2월 13일
@Aaron Smith: get the second output from fopen:
[fileID,errmsg] = fopen(...)
and read the error message. It always turns out to be a spelling mistake, folder permissions, or the file not being in the location that they are looking in.
Aaron Smith
2017년 2월 14일
편집: Aaron Smith
2017년 2월 14일
When using fopen outside of the code itself, it works fine and doesn't create an error. The only thing I can think it could be is the fullfile and sbd in the fopen command. I tried taking it out, moving it but that creates errors with the code. Is there a way to put the fullfile(sbd, ...) part in a separate line?
sbd = 'tempdir';
R = 1025;
C = 1024;
opt = { 'EndOfLine', ';', 'CollectOutput', true };
>> fid = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
>> k = 0;
while ~feof(fid)
k = k + 1;
Z = textscan( fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_ASA.asc', k ));
if rem( numel( Z{1}), R)==0
dlmwrite( S, reshape( Z{1}, [], R).', ';')
else
dlmwrite( S, Z{1}, ';')
end
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
>> [fid, errmsg] = fopen( 'TEST_A.asc' )
fid =
9
errmsg =
''
I was thinking, looking at the fullfile page on mathworks, Should i set up a folder to be a destination for the file?
f = fullfile('myfolder','mysubfolder','myfile.m')
I'm thinking it may be the subdirectory (sbd) that is causing the error
Stephen23
2017년 2월 14일
@Aaron Smith: just get rid of the fullfile if you don't want it.
However I would recommend learning to use filepaths to access data files, as it makes your code faster and more reliable (e.g. compared to cd or other buggy ideas). Note that the file path I used is relative to the current directory, and that this may be different for the command window and the code that is being called: that path needs to exist relative to where the code runs from. One simple resolution is to always specify the an absolute path. The internet is full of help on understanding relative/absolute paths, but you might as well start here:
"Is there a way to put the fullfile(sbd, ...) part in a separate line" Sure, it is just a function, you can put it wherever you want to.
Aaron Smith
2017년 2월 15일
Is there a way for me to share my data file with you so that you can try your code with the actual data? The file is approximately 200mb
Stephen23
2017년 2월 15일
편집: Stephen23
2017년 2월 15일
You could register with dropbox, mediafire, google drive, or one of the many other file sharing websites, and send me the link of the file (via my profile page: please also include a link to this thread otherwise the email will get deleted automatically).
채택된 답변
Stephen23
2017년 2월 15일
편집: Stephen23
2017년 2월 15일
Thank you for the file. What did I learn from the actual data file: that it is not "composed of a single row", but in fact there are 51200 rows in the file that I received.
Why is this important? Because computers are stupid, and they do exactly what they are told to do. Knowing how to read a file correctly requires knowing what format the file has. In this case it is also quite handy for us, because it is trivial to read and write lines without much processing.
The code below worked correctly for me, reading the 200 MB file, and creating 50 smaller files with the rows following the same order as the original file.
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
if sscanf(str,'%d')==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str);
end
fclose(f1d);
fclose(f2d);
Note that:
- the size of the output matrices is 1024x1025 (because there are 1025 numbers per line). This is correct because the first number of each line is simply a line count (check the files and you will see).
- the lines are exactly the same as the original file.
- MATLAB hold one line at a time: the lines are simply read from the large file and written directly to a new file.
- as a result: no matrix, no converting from string to numeric and back to string.
- it is slow because the file is large... reading and writing 51200 lines of 1025 numbers each will take some time.
댓글 수: 7
Aaron Smith
2017년 2월 16일
Thanks Stephen. I knew about the line count number, i was just attributing it to columns rather than rows. I did think it it was all one single row of data. Anyway, thanks so much for your continued help. There is an error message showing up but i'm not sure if there is a fix for it.
>> sbd = 'temp';
>> fid2 = fopen(fullfile( sbd, 'temp_01.asc'), 'w');
>> fid1 = fopen(fullfile( sbd, 'TEST_A.asc' ), 'r');
>> k = 0;
>> while ~feof(fid1)
str = fgetl(fid1);
if sscanf( str, '%d' )==1
k = k + 1;
fclose(fid2);
fnm = fullfile( sbd, sprintf( 'temp_%02d.asc', k));
fid2 = fopen( fnm, 'w');
end
fprintf(fid2, '%s\n', str);
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
[fid1, errmsg] = fopen( 'TEST_A.asc' )
fid1 =
6
errmsg =
''
>> [fid2, errmsg] = fopen( 'test_01.asc', 'w')
fid2 =
7
errmsg =
''
Stephen23
2017년 2월 16일
편집: Stephen23
2017년 2월 17일
"i'm not sure if there is a fix for it."
You need to provide the correct filepath for your files. I put all of my files into one sub-directory of the current path named "temp". That worked for me. Do you see "temp" at the start of my code?
Imagine that you tell MATLAB (or any other programming language that has ever existed) to open this file 'C:\Temp\myfile.txt' But what should happen if there is no such file in that location? Then the programming language cannot read your mind: it cannot guess that you actually meant another location, e.g. 'C:\Temp\testfiles\myfile.txt', or that the file is actually called 'my_mistake.csv'. YOU are the one who has to know where you files are, and YOU have to provide the correct path to fopen (via fullfile if used).
So look at my code: I used a sub-directory named "temp". My files were all in that sub-directory. So I told MATLAB to look in that sub-directory. But when you test for those files like this:
[fid1, errmsg] = fopen( 'TEST_A.asc' )
Where is it looking?: ONLY IN THE CURRENT DIRECTORY. You did not tell fopen to look in any sub-directory, or in any other directories anywhere in your computer, or even anywhere else in the known universe. Just the current directory. Let me ask a question: is the file 'TEST_A.asc' in the current directory? If the answer is no, then why are you telling MATLAB to look for it in the current directory?
fopen failures are most commonly caused by one thing: users not giving the correct path (which includes spelling mistakes of the name).
"i'm not sure if there is a fix for it."
The fix is that you provide fopen with the correct path.
PS: [fid2, errmsg] = fopen( 'test_01.asc', 'w') is a pointless test because it just creates that file wherever you tell it too: see the "w" option? That creates a file. It does not care where.
PPS: Why did you get rid of the t option? You should keep it (unless you plan on doing strange things with EOL characters). Removing random things is not a good way of making code work.
Walter Roberson
2017년 2월 16일
"fopen failures are caused by one thing: users not giving the correct path"
Well, that and permission errors. And networked file access to a server that is not accessible. And bugs in file sharing applications like DropBox. And bugs in using UNC paths. And VPN setup. And encryption certificate problems. And full disks. ...
Aaron Smith
2017년 2월 20일
Thanks Stephen. I did eventually get the code working. The problem was with the save path. I had to specify destinations with the entire path (C\ files\ folder\ folder). You mentioned the first number on each line, the line number (1, 2, 3, 4 etc). Is there a way to remove or ignore this the way the headerlines command in the textscan function does?
Stephen23
2017년 2월 20일
편집: Stephen23
2017년 2월 21일
You could use the sscanf call to get an index, e.g.:
>> str = '10;123;456;789;0;123;';
>> [row,~,~,idx] = sscanf(str,'%d')
row =
10
idx =
3
Or in my answer (untested):
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
[row,~,~,idx] = sscanf(str,'%d');
if row==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str(idx+1:end));
end
fclose(f1d);
fclose(f2d);
Aaron Smith
2017년 2월 21일
Thanks Stephen, that code works as far as I can see. What may I ask are the two ~ in the code doing?
Stephen23
2017년 2월 21일
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Low-Level File I/O에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
아시아 태평양
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)