How do you get multiple minimum values in a column based on grouping and then isolate the corresponding rows for the whole table?

Question

0 개 추천

data.mat

I've got a list of ages in a table, sorted, and grouped by 'names'. Something like this:

perName=findgroups(data.names);
b = splitapply(@(x1){sort(x1)}, data.age, perName);

How do I select the youngest 10 ages (smallest) for every name group (or largest for that matter)? Do I need to build a custom function then reference that function in the splitapply command?

Normally, I can sort the column and just trim out (or isolate) the top several using something like this (for example):

data(1:15,:) = [];

However, I've spent a long time without any luck to trim/isolate based on grouping. It was easy to get the min value using @min funciton, but I'm not sure how to get 'multiple' min or max values for each group

Also, once I've managed to find the top ten, how do I also isolate the corresponding rows in the table for different columns?

Any help is much appreciated!

(edit: I've attached the matlab table for reference. Note this is just made up 'dummy' data for practice. Once I figure it out I have to apply to a much larger dataset.)

댓글 수: 2
없음 표시 없음 숨기기

Image Analyst 2020년 7월 28일

You forgot to attach your data so I don't think anybody is going to try anything until that happens.

Joshua Murray 2020년 7월 28일

Sure, I've editied the original post and attached the matlab table.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

dpb 2020년 7월 28일

편집: dpb 2020년 7월 29일

MATLAB Online에서 열기

0 개 추천

[g,ig]=findgroups(data.names);
b=splitapply(@(x) {maxk(x,10)},data.age,g);

b contains the k oldest elements for each group...you'll have to use an m-file function to return the optional second index from maxk to locate the position of the N individuals in the group.

>> [cellstr(ig) b]
ans =
  7×2 cell array
    {'Ben'  }    {10×1 double}
    {'Billy'}    { 4×1 double}
    {'Happy'}    { 4×1 double}
    {'Jenny'}    {10×1 double}
    {'Joe'  }    { 8×1 double}
    {'Kate' }    {10×1 double}
    {'Smith'}    {10×1 double}
>> 

ADDENDUM: The indexing would look something like--

function [mx,ix]=getmaxk(x,n)
% wrapper for maxk cuz can't return second output w/ anonymous function
  [mx,ix]=maxk(x,n);
  mx={mx};
  ix={ix};
end
fnmaxk=@(x) getmaxk(x,10);
[b,ib]=splitapply(fnmaxk,data.age,g);

ADDENDUM SECOND:

OK, it's not so bad after all -- have to do it explicitly right now; seems like a good candidate for an enhancement to splitapply() to have available as an auxiliary variable.

Start with

function sq=getmaxkgrp(x,v)
% returns the group variable sequence associated with the n maxk values of x
  n=10;
  [mx,ix]=maxk(x,n);
  sq=v(ix);
end

which has the 10 value hardcoded; you can use the indirect method illustrated above to be able to change the number for which you're looking.

Then, as suggested, augment the table with

data.seq=[1:height(data)].';        % add the sequence number in table of each row
ix=splitapply(@(x1,x2) {getmaxkgrp(x1,x2)},data.age,data.seq,g);  % and return that seq number for the top n
tmaxk=(cellfun(@(ix) data(ix,1:end-1),ix,'uni',0));  % retrieve those records from full table
tmaxk=vertcat(tmaxk{:});                             % and combine into a table

Above results in--

>> tmaxk =
  56×3 table
    names    age    pet 
    _____    ___    ____
    Ben      31     fish
    Ben      31     bird
    Ben      31     bird
    Ben      31     bird
    Ben      31     fish
    Ben      21     dog 
    Ben      21     fish
    Ben      21     dog 
    Ben      21     fish
    Ben      17     dog 
    Billy    30     fish
    Billy    30     bird
    Billy    30     fish
    Billy    30     bird
    Happy    31     bird
    Happy    31     cat 
    Happy    31     bird
    Happy    31     cat 
    Jenny    30     cat 
    Jenny    30     dog 
    Jenny    30     dog 
    Jenny    30     fish
    Jenny    30     dog 
    Jenny    30     fish
    Jenny    30     dog 
    Jenny    30     fish
    Jenny    30     cat 
    Jenny    30     dog 
    Joe      79     dog 
    Joe      79     dog 
    Joe      21     dog 
    Joe      20     cat 
    Joe      19     cat 
    Joe      18     bird
    Joe      15     cat 
    Joe      15     cat 
    Kate     60     fish
    Kate     60     fish
    Kate     60     fish
    Kate     51     fish
    Kate     51     fish
    Kate     51     fish
    Kate     51     fish
    Kate     50     bird
    Kate     50     dog 
    Kate     50     bird
    Smith    38     fish
    Smith    38     bird
    Smith    38     fish
    Smith    38     bird
    Smith    38     bird
    Smith    38     fish
    Smith    30     bird
    Smith    29     bird
    Smith    29     cat 
    Smith    29     cat 
>> 

I didn't test it extensively but I believe it will have returned the correct values. Using a smaller value for N would make that part much easier; left as "Exercise for Student!" :)

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Joshua Murray 2020년 7월 29일

편집: Joshua Murray 2020년 7월 29일

MATLAB Online에서 열기

data.mat

Hi dpb, thanks for your reply! I'm trying also to figure how to put the results in a table. If I had extra columns, such as for 'pet', how would I collect them and add them for each of the corresponding names and top ten ages (see updated data.mat file attached with third column for 'pet')? So ideally I might get something like this with only the top 'k' results:

       names       age      pet  
    ___________    ___    _______
    Ben             30    "bird "
    Ben             29    "dog"  
    Ben             20    "cat"  
    %etc            %etc  %etc   
    Kate            50    "dog"  
    Kate            40    "dog"  
    Kate            39    "bird "
    %etc            %etc  %etc
% In the end result I basically want the original data table 
% with all the rows other than those containing the top ten ages (for e.g.)
% to be removed.
% and I only want the relevent top results

What is purpose of the 'ib' in your function above (ie what is it?) and what do you mean by "optional second index". I thought it might reference the cell number in the original data table but it doesn't appear to do so when I'm running the wrapper function. (sorry if this is a basic question).

dpb 2020년 7월 29일

I didn't spearmint (for doublemint, either, for that matter! :) ) to see; it's not too much of a hassle to build the extra variable and it has the advantage of being a numeric variable and so usable as index as is. One MIGHT be able to make use of the internal 'rownames' property to avoid having to create another table variable.

But, it may not be passable as a parameter, either, so maybe not...above seems the most direct route available at the moment.

Joshua Murray 2020년 7월 30일

Thanks! I've made some minor edits to suit the actual data and its working quiet well to find the min/max values. I was resorting to doing this manually (over 1000times) as I hadn't figure anything out yet after thinking about it for half a week... so this has saved me allot of time- don't have much experience with Matlab & similar handling large data. Cheers.

댓글을 달려면 로그인하십시오.

How do you get multiple minimum values in a column based on grouping and then isolate the corresponding rows for the whole table?

댓글 수: 2
없음 표시 없음 숨기기

채택된 답변

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

추가 답변 (0개)

카테고리

태그

Community Treasure Hunt

How do you get multiple minimum values in a column based on grouping and then isolate the corresponding rows for the whole table?

댓글 수: 2 없음 표시 없음 숨기기

채택된 답변

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

추가 답변 (0개)

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기