Find duplicate elements and remove the rows that has similar values in one column

Question

Hamid 2022년 10월 18일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1829548-find-duplicate-elements-and-remove-the-rows-that-has-similar-values-in-one-column

댓글: Jan 2022년 10월 19일

Dear Matlab experts,

I am using the following function to find the rows that has similar value in their 9th column. The speed of calculation is very slow as the data is big. Any suggestions for modifying my code to increase the speed or any other suggestions to achieve that purpose?

Thank you in advance.

function in1=dup_remove(out2)
b=[];
 for i=1:size(out2,1)
    [r,c]=find(out2(:,9)==out2(i,9));
    if(length(r)==1)
        b=[b;out2(i,:)];
    end
 end
    if  (~isempty(b))
    in1=b;
    end
end

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Hamid 2022년 10월 19일

@KSSV Thank you for your suggestion and the support.

Jan 2022년 10월 19일

@KSSV: How? I've tried it without success. The only way with standard Matlab functions I've found, uses unique to get a list of occurring values and histcounts to identify the elements, which occur once only. This was much slower than sorting the input, comparing neighbors by diff , remove the duplicates and reproducing the original order.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Jan 2022년 10월 18일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1829548-find-duplicate-elements-and-remove-the-rows-that-has-similar-values-in-one-column#answer_1078058

편집: Jan 2022년 10월 19일

MATLAB Online에서 열기

Avoid iteratively growing arrays, because they are extremly expensive. See:

x = [];
for k = 1:1e6
    x(k) = rand;
end

This creates a new vector x in each iteration and copies the former contents of the vector to the new one, so Matlab reserves and copies sum(1:1e6)*8 Bytes, which is more than 4 TB!

Pre-allocation solves the problem:

x = zeros(1, 1e6);
for k = 1:1e6
    x(k) = rand;
end

Tis reserves 8 MB only and copies just the scalar elements.

In your case:

function y = dup_remove(x)
x9    = x(:, 9);   % Slightly faster than indexing each time
n     = size(x,1);
match = false(n, 1);
for i = 1:n
    [r, c]   = find(x9 == x9(i));
    match(i) = (numel(r) == 1);
end
y = x(match, :);
end

It is too strange, to call the input "out2" and the output "in1".

A smarter method:

function y = dup_remove(x)
x9       = x(:, 9);     % Slightly faster than indexing each time
T        = true(numel(x9), 1);
[S, idx] = sort(x9(:).');
m        = [true, diff(S) ~= 0];
ini      = strfind(m, [true, false]);
m(ini)   = false;       % Mark 1st occurence in addition
T(idx)   = m;           % Restore original order
y = x(T, :);
end

The sorting avoids to compare each element with all others, but only one comparison with the neighbor is required.

댓글 수: 2
없음 표시없음 숨기기

Jan 2022년 10월 18일

편집: Jan 2022년 10월 18일

MATLAB Online에서 열기

Some timings:

x = randi([0, 65535], 1e4, 9);
n = 10;  % Repeat loops for accurate timings
tic
for k = 1:n
    y0 = dup_remove(x);
end
toc  % Original:
Elapsed time is 2.469073 seconds.
tic
for k = 1:n
    y1 = dup_remove1(x);
end
toc  % Avoid iterative growing:
Elapsed time is 1.181532 seconds.
tic
for k = 1:n
    y11 = dup_remove11(x);
end
toc  % Without FIND:
Elapsed time is 0.974451 seconds.
tic
for k = 1:n
    y2 = dup_remove2(x);
end
toc  % Using SORT and comparison of neighbors:
Elapsed time is 0.011396 seconds.
function in1=dup_remove(out2)
b=[];
 for i=1:size(out2,1)
    [r,c]=find(out2(:,9)==out2(i,9));
    if(length(r)==1)
        b=[b;out2(i,:)];
    end
 end
    if  (~isempty(b))
    in1=b;
    end
end
function y = dup_remove1(x)
x9 = x(:, 9);   % Slightly faster than indexing each time
n  = size(x,1);
m  = false(n, 1);
for i = 1:n
    [r, c] = find(x9 == x9(i));
    m(i)   = (numel(r) == 1);
end
y = x(m, :);
end
function y = dup_remove11(x)
x9 = x(:, 9);   % Slightly faster than indexing each time
n  = size(x,1);
m  = false(n, 1);
for i = 1:n
    m(i) = (sum(x9 == x9(i)) == 1);
end
y = x(m, :);
end
function y = dup_remove2(x)
x9       = x(:, 9);     % Slightly faster than indexing each time
T        = true(numel(x9), 1);
[S, idx] = sort(x9(:).');
m        = [true, diff(S) ~= 0];
ini      = strfind(m, [true, false]);
m(ini)   = false;       % Mark 1st occurence in addition
T(idx)   = m;           % Restore original order
y = x(T, :);
end

Hamid 2022년 10월 19일

편집: Hamid 2022년 10월 19일

@Jan Thank you so much. I learned a lot studying your answer.

댓글을 달려면 로그인하십시오.

Find duplicate elements and remove the rows that has similar values in one column

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Find duplicate elements and remove the rows that has similar values in one column

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 2
없음 표시없음 숨기기