How do you get multiple minimum values in a column based on grouping and then isolate the corresponding rows for the whole table?

I've got a list of ages in a table, sorted, and grouped by 'names'. Something like this:
perName=findgroups(data.names);
b = splitapply(@(x1){sort(x1)}, data.age, perName);
How do I select the youngest 10 ages (smallest) for every name group (or largest for that matter)? Do I need to build a custom function then reference that function in the splitapply command?
Normally, I can sort the column and just trim out (or isolate) the top several using something like this (for example):
data(1:15,:) = [];
However, I've spent a long time without any luck to trim/isolate based on grouping. It was easy to get the min value using @min funciton, but I'm not sure how to get 'multiple' min or max values for each group
Also, once I've managed to find the top ten, how do I also isolate the corresponding rows in the table for different columns?
Any help is much appreciated!
(edit: I've attached the matlab table for reference. Note this is just made up 'dummy' data for practice. Once I figure it out I have to apply to a much larger dataset.)

댓글 수: 2

You forgot to attach your data so I don't think anybody is going to try anything until that happens.
Sure, I've editied the original post and attached the matlab table.

댓글을 달려면 로그인하십시오.

 채택된 답변

dpb
dpb 2020년 7월 28일
편집: dpb 2020년 7월 29일
[g,ig]=findgroups(data.names);
b=splitapply(@(x) {maxk(x,10)},data.age,g);
b contains the k oldest elements for each group...you'll have to use an m-file function to return the optional second index from maxk to locate the position of the N individuals in the group.
>> [cellstr(ig) b]
ans =
7×2 cell array
{'Ben' } {10×1 double}
{'Billy'} { 4×1 double}
{'Happy'} { 4×1 double}
{'Jenny'} {10×1 double}
{'Joe' } { 8×1 double}
{'Kate' } {10×1 double}
{'Smith'} {10×1 double}
>>
ADDENDUM: The indexing would look something like--
function [mx,ix]=getmaxk(x,n)
% wrapper for maxk cuz can't return second output w/ anonymous function
[mx,ix]=maxk(x,n);
mx={mx};
ix={ix};
end
fnmaxk=@(x) getmaxk(x,10);
[b,ib]=splitapply(fnmaxk,data.age,g);
ADDENDUM SECOND:
OK, it's not so bad after all -- have to do it explicitly right now; seems like a good candidate for an enhancement to splitapply() to have available as an auxiliary variable.
Start with
function sq=getmaxkgrp(x,v)
% returns the group variable sequence associated with the n maxk values of x
n=10;
[mx,ix]=maxk(x,n);
sq=v(ix);
end
which has the 10 value hardcoded; you can use the indirect method illustrated above to be able to change the number for which you're looking.
Then, as suggested, augment the table with
data.seq=[1:height(data)].'; % add the sequence number in table of each row
ix=splitapply(@(x1,x2) {getmaxkgrp(x1,x2)},data.age,data.seq,g); % and return that seq number for the top n
tmaxk=(cellfun(@(ix) data(ix,1:end-1),ix,'uni',0)); % retrieve those records from full table
tmaxk=vertcat(tmaxk{:}); % and combine into a table
Above results in--
>> tmaxk =
56×3 table
names age pet
_____ ___ ____
Ben 31 fish
Ben 31 bird
Ben 31 bird
Ben 31 bird
Ben 31 fish
Ben 21 dog
Ben 21 fish
Ben 21 dog
Ben 21 fish
Ben 17 dog
Billy 30 fish
Billy 30 bird
Billy 30 fish
Billy 30 bird
Happy 31 bird
Happy 31 cat
Happy 31 bird
Happy 31 cat
Jenny 30 cat
Jenny 30 dog
Jenny 30 dog
Jenny 30 fish
Jenny 30 dog
Jenny 30 fish
Jenny 30 dog
Jenny 30 fish
Jenny 30 cat
Jenny 30 dog
Joe 79 dog
Joe 79 dog
Joe 21 dog
Joe 20 cat
Joe 19 cat
Joe 18 bird
Joe 15 cat
Joe 15 cat
Kate 60 fish
Kate 60 fish
Kate 60 fish
Kate 51 fish
Kate 51 fish
Kate 51 fish
Kate 51 fish
Kate 50 bird
Kate 50 dog
Kate 50 bird
Smith 38 fish
Smith 38 bird
Smith 38 fish
Smith 38 bird
Smith 38 bird
Smith 38 fish
Smith 30 bird
Smith 29 bird
Smith 29 cat
Smith 29 cat
>>
I didn't test it extensively but I believe it will have returned the correct values. Using a smaller value for N would make that part much easier; left as "Exercise for Student!" :)

댓글 수: 4

Hi dpb, thanks for your reply! I'm trying also to figure how to put the results in a table. If I had extra columns, such as for 'pet', how would I collect them and add them for each of the corresponding names and top ten ages (see updated data.mat file attached with third column for 'pet')? So ideally I might get something like this with only the top 'k' results:
names age pet
___________ ___ _______
Ben 30 "bird "
Ben 29 "dog"
Ben 20 "cat"
%etc %etc %etc
Kate 50 "dog"
Kate 40 "dog"
Kate 39 "bird "
%etc %etc %etc
% In the end result I basically want the original data table
% with all the rows other than those containing the top ten ages (for e.g.)
% to be removed.
% and I only want the relevent top results
What is purpose of the 'ib' in your function above (ie what is it?) and what do you mean by "optional second index". I thought it might reference the cell number in the original data table but it doesn't appear to do so when I'm running the wrapper function. (sorry if this is a basic question).
Using the wrapper function fnmaxk allows splitapply to return multiple outputs; the second optional argument from maxk is the index of the locations associated with the values.
That's what the ib array is in the example code above; to just output the values alone don't need it so I just used the maxk function by itself in the other example.
I think you'll have to wrap all the variables into the function to do at one time; it appears to be a hole afaict that there isn't a provided link between the positions returned by the lookup function into the group positions back to the position in the original table--that information is not available to the functions inside splitapply because it is just passed the group of data for the specific group, each in turn.
I'd have to think about it; somehow one should be able to return the grouping variable indices -- probably need to build it into the table and operate on it as one of the variables is the thought that comes to mind.
I didn't spearmint (for doublemint, either, for that matter! :) ) to see; it's not too much of a hassle to build the extra variable and it has the advantage of being a numeric variable and so usable as index as is. One MIGHT be able to make use of the internal 'rownames' property to avoid having to create another table variable.
But, it may not be passable as a parameter, either, so maybe not...above seems the most direct route available at the moment.
Thanks! I've made some minor edits to suit the actual data and its working quiet well to find the min/max values. I was resorting to doing this manually (over 1000times) as I hadn't figure anything out yet after thinking about it for half a week... so this has saved me allot of time- don't have much experience with Matlab & similar handling large data. Cheers.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 2-D and 3-D Plots에 대해 자세히 알아보기

질문:

2020년 7월 28일

댓글:

2020년 7월 30일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by