필터 지우기
필터 지우기

How to delete similar data from big data file?

조회 수: 1 (최근 30일)
M.Shaarawy
M.Shaarawy 2020년 1월 9일
댓글: Sindar 2020년 1월 13일
I have a big data file consists of 100000 rows and 5 columns ..and within this file there rows that are similar to each other,so I need to delete the similar ones and let one of them and delete the other similar ones ( fir example, if I have 5 rows have the same data ..I need a line code to choose the maximum value of the five rows and delete the other 4 rows and if all similar ...delete 4 and makes one only present.
  댓글 수: 6
per isakson
per isakson 2020년 1월 13일
You received two good answers. However, one question: does the order of the rows in the result matter?
Your sample data is, in more readable form
%%
data = [
1970 30 30 4
1971 30.5 30.5 4.2
1970 31 32 4.2
1972 32 33 5
1972 32 33 5
1973 32.1 32.2 5
1974 34 35 4.5
];
Sindar
Sindar 2020년 1월 13일
Ah, yes, I forgot to mention that both my answers will sort by the first three columns (year, then x, then y).

댓글을 달려면 로그인하십시오.

답변 (2개)

Sindar
Sindar 2020년 1월 12일
Alright, here's my solution. Its speed depends on how often you expect duplicates (controlled in test by test_vals). I ran a test with very few duplicates (test_vals = 200, only 600/1e5 duplicates) and it took 8s. More duplicates runs faster. There are probably ways to speed up the idx search, but it seems fast enough.
% setting up some test data, skip for your real data
test_rows = 1e5;
test_vals = 5;
% make a test matrix with random integers 1-test_vals in first three columns, random number in last
data = [randi(test_vals,test_rows,3) rand(test_rows,1)];
% find duplicate data in the first three columns
% ia lists the row # in data of the first row in each duplicate
% ic lists the duplicate-group of each row in data
[~,ia,ic] = unique(data(:,1:3),'rows');
% loop over the duplicate-groups
for ind=1:length(ia)
% skip ahead if no duplicates (comment out if you expect few unique rows)
if nnz(ic==ind)==1
idx(ind)=ia(ind);
else
% data(:,4).*(ic==ind) takes the z-column, and zeros it except for the current duplicate-group
% then, stores the index of the max z row in idx
[~,idx(ind)] = max(data(:,4).*(ic==ind));
end
end
% select only the max-z rows for each duplicate-group
clean_data = data(idx,:);
  댓글 수: 3
Sindar
Sindar 2020년 1월 13일
actually, it can't. We need the index from the whole array, not the subset that logical indexing will give you. There may be a better way than mine, but this isn't it.
per isakson
per isakson 2020년 1월 13일
편집: per isakson 2020년 1월 13일
Yes indeed, that's wrong. Isn't it even syntactically incorrect and ought to have thrown an error.
I should have written
data( ic==ind, 4 )

댓글을 달려면 로그인하십시오.


Sindar
Sindar 2020년 1월 12일
편집: per isakson 2020년 1월 13일
Looking into it a bit (I have a similar problem), it seems like there is a way using the "Split-Apply-Combine Workflow". It is faster for many duplicate groups, but slower for few (for my in-memory tests), but I think it should scale much better to distributed memory and/or parallel computation.
% lists the duplicate-group of each row in data, equal to ic, above
G = findgroups(data(:,1),data(:,2),data(:,3));
% for each group, apply the maxrow function
% this takes the 4th column of data (your z's), finds the index of the max, then
% returns that full row repeated for each duplicate-group
splitapply(@(x1) maxrow(x1(:,4),x1),data,G)
% return the row of full_data corresponding to the index of the maximum of sort_vec
function x = maxrow(sort_vec,full_data)
[~,idx] = max(A);
x = B(idx,:);
end

카테고리

Help CenterFile Exchange에서 Logical에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by