Error reading data from combined datastore

조회 수: 6 (최근 30일)
Kartik Chandran
Kartik Chandran 2022년 9월 12일
답변: Walter Roberson 2022년 9월 12일
Dear all,
I'm working on a bioinformatics script and am trying to read in two large sequence (.fastq) text files and process them in parallel. Each entry in File 1 has a corresponding entry in File 2 and they need to be processed together. See attached for an example set of files (renamed .txt so they can be uploaded here).
To do this, I first created two separate datastores, one with each file, as follows:
ds_R1 = tabularTextDatastore(reads_folder_R1,"FileExtensions",[".fastq",".fastq.gz"]);
ds_R2 = tabularTextDatastore(reads_folder_R2,"FileExtensions",[".fastq",".fastq.gz"]);
I also defined ds.Readsize to pull 100,000 lines at a time from the datastores:
ds.ReadSize = 100000;
To ensure concurrent handling of each file pair, I combined the two datastores (each containing one file in the pair as above):
%Combine and partition datastores by file and return partition m
ds_R1_R2 = combine(ds_R1, ds_R2);
I created a while loop to pull data from the combined datastore ds_R1_R2 into a cell array 'reads', do operations on that cell array, and write the output to file.
while hasdata(ds_R1_R2)
[reads, info] = read(ds_R1_R2);
%Convert reads table to cell array
reads = table2cell(reads);
reads_R1 = reads(:,1);
reads_R2 = reads(:,2);
%do stuff to reads_R1 and reads_R2
end
I tested this code out and it works fine for a number of iterations of the while loop. However, it always fails with the following error message after the same number of iterations for a given pair of files (the exact iteration depends on which file pair it is processing).
Error using matlab.io.datastore.CombinedDatastore/read (line 144)
All tables being horizontally concatenated must have the same number of rows.
I've checked and confirmed that the number of lines in each file is exactly the same. The error is also thrown pretty early on and there is plenty of data remaining, so it's not because the end of the files is approached.
I'm quite puzzled and would greatly appreciate any input.
Thanks!
Kartik
  댓글 수: 2
Walter Roberson
Walter Roberson 2022년 9월 12일
if you dbstop if caught error and run until error and save() the values to a file, then dbquit and restore the file contents and try the [] operation manually... then does it succeed?
I suspect that there is a try/catch and I wonder if maybe it is a different error being caught but reported as-if it were a problem with different number of rows
Kartik Chandran
Kartik Chandran 2022년 9월 12일
Hi Walter,
I went ahead and placed a breakpoint in CombinedDatastore.m at the step where it is retrieving data from the underlying datastores.
numDatastores = numel(ds.UnderlyingDatastores);
data = cell(1, numDatastores);
info = cell(1, numDatastores);
for ii = 1:numel(ds.UnderlyingDatastores)
[data{ii}, info{ii}] = read(ds.UnderlyingDatastores{ii});
data{ii} = iMakeUniform(data{ii}, ds.UnderlyingDatastores{ii});
end
data = horzcat(data{:});
The last datasets from each datastore (data{1} and data{2}) prior to failure indeed have a different number of rows: 53,861 and 60,516. There is way more data there so it's not clear why 100,000 lines were not retrieved this time as in previous iterations. Also, parity between the datastores was not maintained.
I've attached these files as well. The following data look no different, so I'm really at a loss to understand why different numbers of rows would be retrieved (and <100,000).
Thanks,
kc

댓글을 달려면 로그인하십시오.

답변 (1개)

Walter Roberson
Walter Roberson 2022년 9월 12일
ReadSize defines a maximum number of rows to read at one time -- but it is permitted to read fewer rows. In particular it has some kind of internal buffer and avoids overfilling the buffer. If two different datastores have substantially different number of columns (or different widths for each column), then it would be possible for the buffer to get full with fewer rows for the datastore that has more (or wider) columns.
You could reduce the ReadSize to the point where each chunk of the wider datastore fits within the buffer.
The size of the buffer does not appear to be documented.

제품


릴리스

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by