How to remove columns in very large matrices.

조회 수: 5 (최근 30일)
james ingham
james ingham 2025년 2월 15일
답변: Walter Roberson 2025년 2월 15일
Hi all,
I have run into a problem, with it takign a huge amount of time to remove columns from a large matix.
I am using the standard approach of:
matrix(:,badcols) = [];
And this works but I am currently handelling large martices ~ 1,750,000x2,000, and it is taking much longer than expected to simply remove a few columns. Is this just a practical limiation while working with large datasets or is there a way that I can better handle/process the data to process this more efficently?
Thanks for the help!
J
  댓글 수: 1
dpb
dpb 2025년 2월 15일
>> x=zeros(1750000,2000);
Error using zeros
Requested 1750000x2000 (26.1GB) array exceeds maximum array size preference (15.9GB). This might cause MATLAB to become unresponsive.
Related documentation
>>
You're quite possibly running into disc-thrashing owing to exceeding memory available...
It probably won't make any difference but you could try the alternative syntax of
matrix=matrix(:,~badcols);
Other than that, breaking it down to process the data in smaller chunks either by row or column depending on what the operations are is the classic approach, yes.
Or, check into the tools for large datasets under the "Large Files and Big Data" section under "Data Import and Analysis".

댓글을 달려면 로그인하십시오.

답변 (3개)

John D'Errico
John D'Errico 2025년 2월 15일
GET MORE RAM!!!!!!!!!
RAM IS CHEAP!!!!!!!!
Need I say it again? You have a quite large array.
(1.75e6*2000*8)/1e9
ans = 28
That array, in double precision form, uses approximatey 28 gigabytes of RAM. When you remove some random columns, you force MATLAB to completely reallocate the entire array, copy over all 28 gigabytes to the new addresses. And that means you need to have roughly 56 gigabytes of free memory, because temporarily, there will be two version of your data.
What can you do? The simplest thing is to work in single precision. If possible, int16 or int8, will be even better, in terms of memory required if your array is integer. But single might be sufficient, since it cuts the memory load by half, and it is still a floating point form.
What else can you do? Learn to use tall arrays! A tall array is designed to work on huge arrays like this. By way of comparison, done on my own computer with only 40 GB of RAM, I see these times:
A = rand(1e6,1000);
tic,A(:,3) = [];toc
Elapsed time is 9.746878 seconds.
A = rand(1e6,1000,'single');
tic,A(:,3) = [];toc
Elapsed time is 2.885848 seconds.
A = int8(20*randn(1e6,1000));
tic,A(:,3) = [];toc
Elapsed time is 0.825612 seconds.
A = tall(rand(1e6,1000));
tic,A(:,3) = [];toc
Elapsed time is 0.006552 seconds.
As you can see, by use of single, or int8 where possible, I was able to seriously cut the time required to remove the specified column. That is entirely due to the smaller footprint of the array itself. However, the tall array put them all to shame, and it did not force me to reduce the precision in any way. Of course, tall arrays do take some additional effort to learn and to use.

Matt J
Matt J 2025년 2월 15일
편집: Matt J 2025년 2월 15일
Use a sparse matrix, if applicable.
matrix=sprand(1750000 , 2000, 1000/1750000);
whos matrix
Name Size Bytes Class Attributes matrix 1750000x2000 32006920 double sparse
tic;
matrix(:,1:3:end)=[];
toc
Elapsed time is 0.004864 seconds.

Walter Roberson
Walter Roberson 2025년 2월 15일
Working by selecting columns to save is marginally slower than working with columns to delete, on average. The timing overlaps -- the slowest select-to-delete was worse than the slowest select-to-save.
Meanwhile, punting through a simple function took twice as long (!!). This is surprising as punting through a function should have invoked potential in-place modification.
rng(655321)
x = rand(17500,2000);
badcols = randi(size(x,2), 1, 20);
tic
x = x(:,setdiff(1:size(x,2),badcols));
toc
Elapsed time is 0.144821 seconds.
rng(655321)
x = rand(17500,2000);
badcols = randi(size(x,2), 1, 20);
tic
x(:,badcols) = [];
toc
Elapsed time is 0.119520 seconds.
rng(655321)
x = rand(17500,2000);
badcols = randi(size(x,2), 1, 20);
tic
x = DeleteColumns(x, badcols);
toc
Elapsed time is 0.270778 seconds.
rng(655321)
x = rand(17500,2000);
badcols = randi(size(x,2), 1, 20);
tic
x = NullColumns(x, badcols);
toc
Elapsed time is 0.151895 seconds.
function x = DeleteColumns(x, badcols)
x(:,badcols) = [];
end
function x = NullColumns(x, badcols)
x(:,badcols) = 0;
end

카테고리

Help CenterFile Exchange에서 Performance and Memory에 대해 자세히 알아보기

태그

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by