Finding groups based on matching multiple values in a column
조회 수: 44 (최근 30일)
이전 댓글 표시
I'd like to find groups in my table based on the values of multiple columnns- but I'd like to have the groups allow multiple specific values in one of the columns.
For the "I'd like to find groups in my table based on the values of multiple columnns" part, I can do that:
>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
>> [G,~,idx]=unique(T(:,[1 2]),'stable') %want 'stable' option so not using findgroups
G =
5×2 table
Store Country
______________ __________
{'Home Depot'} {'USA' }
{'Lowes' } {'China' }
{'Home Depot'} {'Canada'}
{'Menards' } {'France'}
{'Menards' } {'USA' }
idx =
1
2
3
4
5
3
But what I'd like to do is introduce an 'or' grouping rule to designate the region that Country is in. In this case, I want to group the Country by region as well. I can also do that... though not sure my converting to cell method is the best way, but it works.
>> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
1
3
5
6
But ultimately, I want to do this at the same time- show groups that match both the Store as well as the Region. I am not sure the best way to go about this- my actual table is very large. I thought of creating a new varibale for Region and then match off that- is that the way to go or can I combine these into one sort? I was also considering looping methods, but not sure the most efficient way to proceed. My desired result would be:
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
idx =
1
2
1
3
4
1
Thanks for any insight.
댓글 수: 3
채택된 답변
Adam Danz
2021년 2월 6일
편집: Adam Danz
2021년 2월 6일
I recommend adding the region column to your tables. I doubt it will have an impact on computation time relative to using a separate column and it will keep the data tidy.
I prefer to work with catgegories over string when working with categorical data.
T=table(categorical({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Menards'}),...
categorical({'ISS';'Mexico';'Canada';'France';'USA';'Germany'}),...
categorical({'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'}),...
[123;456;789;1010;1112;1314],...
'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'}) % modified example
Add region column for regions (assuming you have a key already)
% Column 1 is region name
% Column 2 is a list of countries in the region
regions = {
categorical({'North America'}), categorical({'Canada', 'USA', 'Mexico'});
categorical({'Europe'}), categorical({'France', 'Germany', 'Bulgaria'});
};
T.Region = repmat(categorical({'NA'}), height(T),1); % Default is NA
for i = 1:size(regions,1);
T.Region(ismember(T.Country, regions{i,2})) = regions{i,1};
end
disp(T)
Summary of Stores by region using
Tsummary = groupsummary(T,["Store","Region"],["sum","mean"], "Revenue")
Time test
I replicated the demo table many times to create a single table with 184,320 rows and times this solution 1000 times using tic/toc (including the construction of the regions array but not including the disp()). The median of the distribution of durations was 0.07 seconds.
댓글 수: 2
Adam Danz
2021년 2월 6일
> I believe I can use setoptions to make the appropriate variables categorical.
Perhaps (I haven't looked into it) but you could also convert from char to categorical the way I did.
> That may or may not save me time but it certianly seems nicer.
I think it will save you time.
> what is the reson for the (admittedly small) distribution
There will always be variation. Here are some reasons.
- Random fluxuation
- JIT compilation (see this answer)
- Other stuff my computer is doing that may limit resources.
- Other stuff I don't know about :D
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Data Preprocessing에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!