How would I 'trim-the-fat' off of individual text files that are part of a loop?
조회 수: 1 (최근 30일)
이전 댓글 표시
Hello,
I'm working on a script that's going to read single column, no header .txt files. Each file in a perfect world would be an exact multiple of 36,000,000 lines of data, however data get's stored with an additional 1 to 5,000,000 million lines. I do not need this data.
What I'm currently using is a file splitter on Linux command line, that splits data into 36,000,000 line chunks, and removes anything that is less than that. Here's what that looks like
clear
echo Hello Human. Please enter the date of the data to be analyzed [mmddyyyy]
echo
read DataAnalysis
echo
echo Would you like to analyze DEF, LFM, or SUM?
echo
read DataType
echo
echo Thank you Human, please wait.......
echo
cd $DataAnalysis
split -d -l 36000000 *Live*$DataType* x0
split -d -l 36000000 *Dead*$DataType* x1
#Below, this removes anything with a length less than the bin time. This removes excess data
find . -name 'x*' | xargs -i bash -c 'if [ $(wc -l {}|cut -d" " -f1) -lt 36000000 ] ; then rm -f {}; fi'
mkdir Chopped
mv -S .txt x0* Chopped
mv -S .txt x1* Chopped
#Below, this turns all files into .txt files by adding the .txt suffix
find . -name 'x*' -print0 | xargs -0 -i% mv % %.txt
echo
echo
echo *****Data Chop Complete Human*****
echo
echo
Now this script is dependant on there being a single "LIVE" file, and a single "DEAD" file, which isn't going to always be the case. I'm going to have multiple files with arbitrary names, that need to be analyzed and concatonated in a specific order. What I currently have for file selection in MatLab is the following
%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\Data\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection
prompt = 'Enter save-name according to: file_mmddyyyy_signal ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','LabData');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
When I load the file produced by this script, how would I write a script to scan every file and ignore excess data points that don't equal 36,000,000?
댓글 수: 1
dpb
2019년 9월 27일
편집: dpb
2019년 9월 27일
Have pointed this out numerous times but will try yet again...use fullfile() to build file names from the pieces-parts instead of string catenation operations and you won't have to mess with what the file separator character is--ML will take care of it automagically at runtime.
A corollary of the above is to not store a system-specific character in the default base names but to build them at runtime also from the name strings only using fullfile so they'll also match the OS you're running on.
채택된 답변
dpb
2019년 9월 27일
편집: dpb
2019년 9월 28일
" I'm going to have multiple files with arbitrary names, that need to be analyzed and concat[e]nated "
Presuming this is related to the previous topic, I'd (yet again) suggest it's probably not necessary (or even desireable) to generate all the arbitrary intermediate files...
N=yourbignumber;
fid=fopen(yourreallyreallyreaalybigfile.txt','r');
while ~feof(fid)
[data,nread]=fscanf(fid,'%f',N);
if nread<N
% whatever to do with the full section results goes here
else
% anything want to do with the short section results goes here
end
end
fid=fclose(fid);
Inside that full section clause can be the other loop we just went through that uses the second magic number of 400K records to process.
댓글 수: 2
dpb
2019년 9월 30일
"each file which will ahve a tail end of data I don't need will simply be ignored"
Depends. The above will read up to N records -- there could be fewer records in the file, there could be an error inside the file or there could be N or more records but had an out-of-memory problem reading the full N.
In the above, you'll read however many sets there are in the file before the loop quits but you'll know how many records were read each time and can take action accordingly.
If you have only a fixed number of total records that are wanted (some multiple of N), then would need to use a counter to keep track of how many sets you've read and break when that's done.
In the other thread, it is presumed the N is the total number of records wanted and there's no need in that case for the while loop. This would be how to read the N=400K blocks if don't read the whole set wanted in one go.
How you do this is up to you in the end; I was just trying to get you past the original postings of some time ago that were to break up the big file into a zillion little ones.
추가 답변 (1개)
Guillaume
2019년 9월 26일
편집: Guillaume
2019년 9월 26일
If I understood correctly:
opt = detectImportOptions(yourtextfile);
opt.DataLines = [1 36e6]; %only read the first 36000000 lines if there are more
data = readtable(yourtextfile, opt); %R2019a or later, use readmatrix instead if you want a plain matrix
If the files are guaranteed to have at least 36,000,000 lines then this would work as well:
data = csvread(yourtextfile, 0, 0, 36e6, 0);
but will error if there are less than 36,000,000 lines, unlike the 1st option which will read whatever there is.
댓글 수: 1
dpb
2019년 9월 27일
편집: dpb
2019년 9월 27일
One can always put the read in a try...catch block to handle the short file section case.
N=yourbignumber;
fid=fopen(yourtextfile,'r');
try
data=fscanf(fid,'%f',N);
catch ME
% anything want to do with the short section results goes here
end
fid=fclose(fid);
The above also will not error no matter the file size (well, it might, but you've anticipated it and have way to handle it gracefully).
The other thing of this way is you have a direct 1D double array; the readtable option above will return the data in a MATLAB table object which, for just one variable, doesn't have much benefit.
참고 항목
카테고리
Help Center 및 File Exchange에서 National Instruments Frame Grabbers에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!