How to delete similar data from big data file?

Question

M.Shaarawy 2020년 1월 9일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file

댓글: Sindar 2020년 1월 13일

I have a big data file consists of 100000 rows and 5 columns ..and within this file there rows that are similar to each other,so I need to delete the similar ones and let one of them and delete the other similar ones ( fir example, if I have 5 rows have the same data ..I need a line code to choose the maximum value of the five rows and delete the other 4 rows and if all similar ...delete 4 and makes one only present.

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

per isakson 2020년 1월 13일

MATLAB Online에서 열기

You received two good answers. However, one question: does the order of the rows in the result matter?

Your sample data is, in more readable form

%%
data = [     
        1970 30   30   4 
        1971 30.5 30.5 4.2 
        1970 31   32   4.2 
        1972 32   33   5 
        1972 32   33   5 
        1973 32.1 32.2 5 
        1974 34   35   4.5
        ];

Sindar 2020년 1월 13일

Ah, yes, I forgot to mention that both my answers will sort by the first three columns (year, then x, then y).

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Sindar 2020년 1월 12일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file#answer_409803

MATLAB Online에서 열기

Alright, here's my solution. Its speed depends on how often you expect duplicates (controlled in test by test_vals). I ran a test with very few duplicates (test_vals = 200, only 600/1e5 duplicates) and it took 8s. More duplicates runs faster. There are probably ways to speed up the idx search, but it seems fast enough.

% setting up some test data, skip for your real data
test_rows = 1e5;
test_vals = 5;
% make a test matrix with random integers 1-test_vals in first three columns, random number in last
data = [randi(test_vals,test_rows,3) rand(test_rows,1)];
% find duplicate data in the first three columns
% ia lists the row # in data of the first row in each duplicate
% ic lists the duplicate-group of each row in data
[~,ia,ic] = unique(data(:,1:3),'rows');
% loop over the duplicate-groups
for ind=1:length(ia)
    % skip ahead if no duplicates (comment out if you expect few unique rows)
    if nnz(ic==ind)==1
        idx(ind)=ia(ind);
    else
        % data(:,4).*(ic==ind) takes the z-column, and zeros it except for the current duplicate-group
        % then, stores the index of the max z row in idx
        [~,idx(ind)] = max(data(:,4).*(ic==ind));
    end
end
% select only the max-z rows for each duplicate-group
clean_data = data(idx,:);

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Sindar 2020년 1월 13일

actually, it can't. We need the index from the whole array, not the subset that logical indexing will give you. There may be a better way than mine, but this isn't it.

per isakson 2020년 1월 13일

편집: per isakson 2020년 1월 13일

MATLAB Online에서 열기

Yes indeed, that's wrong. Isn't it even syntactically incorrect and ought to have thrown an error.

I should have written

data( ic==ind, 4 )

댓글을 달려면 로그인하십시오.

Answer 2

Sindar 2020년 1월 12일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file#answer_409808

편집: per isakson 2020년 1월 13일

MATLAB Online에서 열기

Looking into it a bit (I have a similar problem), it seems like there is a way using the "Split-Apply-Combine Workflow". It is faster for many duplicate groups, but slower for few (for my in-memory tests), but I think it should scale much better to distributed memory and/or parallel computation.

% lists the duplicate-group of each row in data, equal to ic, above
G = findgroups(data(:,1),data(:,2),data(:,3));
% for each group, apply the maxrow function
% this takes the 4th column of data (your z's), finds the index of the max, then 
% returns that full row repeated for each duplicate-group
splitapply(@(x1) maxrow(x1(:,4),x1),data,G)
% return the row of full_data corresponding to the index of the maximum of sort_vec
function x = maxrow(sort_vec,full_data)
    [~,idx] = max(A);
    x = B(idx,:);
end