use unstack to reshape table with dummy variable (edited: alternative crosstab method)

Question

0 개 추천

After I sort table according to a group variable, I like to put the group value as the 'header' column and put its members in the other columns in the same row.

Data for the sake of demostration:

C = {'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
    'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'});
[GroupsID, Groups] = findgroups(T.code);
unique_groupID = unique(GroupsID);
gT = table('Size', [10,4], 'VariableTypes', {'string', 'string', 'string', 'string'});

Method 1. (edited.) brutal for-loop that I don't like, and its result needs more processing to remove redundancy in each row.

for k=1:size(unique_groupID)
    % extract group elements from 'fruit'
    tmp = T.fruit(GroupsID==unique_groupID(k));
    l = size(tmp',2);
    gT(k,1) = {Groups(k)};
    gT(k,2:l+1) = tmp';
end
rmmissing(gT, "MinNumMissing", 3)
ans = 3×4 table
    Var1      Var2        Var3         Var4   
    ____    ________    _________    _________

    "q1"    "apple"     "appl"       <missing>
    "q2"    "banana"    <missing>    <missing>
    "q3"    "orange"    "orang"      "orange" 

Method 2 using unstack

I created a dummy variable for this method in order to use unstack( ). The code is shorter but doesn't give me the result I want.

D = {'dm1', 'dm2', 'dm3', 'dm4', 'dm5', 'dm6';
    'q1', 'q1', 'q2', 'q3', 'q3', 'q3';
    'apple', 'appl', 'banana', 'orange', 'orang', 'orange'}';
T = cell2table(D, 'VariableNames', {'dummy', 'code', 'fruit'});
unstack(T, "fruit", "dummy")
ans = 3×7 table
     code        dm1           dm2           dm3           dm4           dm5           dm6    
    ______    __________    __________    __________    __________    __________    __________

    {'q1'}    {'apple' }    {'appl'  }    {0×0 char}    {0×0 char}    {0×0 char}    {0×0 char}
    {'q2'}    {0×0 char}    {0×0 char}    {'banana'}    {0×0 char}    {0×0 char}    {0×0 char}
    {'q3'}    {0×0 char}    {0×0 char}    {0×0 char}    {'orange'}    {'orang' }    {'orange'}

Edited. Method 3 using crosstab. This method works nicely, but I wish I don't have to use a for-loop. The result from this method is exactly what I want.

[tb,~,~,lbs] = crosstab(T.code, T.fruit);

For-loop to create the intended table:

m = size(tb,1);
header = lbs(1:m,1);
fruits = lbs(:,2);
gT = table('Size',[6,4], 'VariableTypes', {'string','string','string','string'});
for i=1:m
    tmp = fruits(tb(i,:)>0)';
    l = size(tmp,2);
    gT(i,"Var1") = header(i);
    gT(i, 2:l+1) = tmp;
end
rmmissing(gT, "MinNumMissing",4)
ans = 3×4 table
    Var1      Var2        Var3         Var4   
    ____    ________    _________    _________

    "q1"    "apple"     "appl"       <missing>
    "q2"    "banana"    <missing>    <missing>
    "q3"    "orange"    "orang"      <missing>

Edited. After I post the above code, I thought about that Method 3 may be made leaner.

ix = find(tb>0);
[rows,cols]=ind2sub([3,4],ix);
% then for-loop through rows and cols to populate the final table.
% I still can't avoid for-loop.

댓글 수: 2
없음 표시 없음 숨기기

the cyclist 2023년 4월 1일

I'm confused about the organizing principle of the output you want. Is the following correct?

Currently, you have one row in your table for each code/fruit pair.
Instead, you one row for each code
The columns are "1st fruit", "2nd fruit", "3rd fruit"
In any row, you want the list of the fruits for that code
If the code doesn't have 3 fruits, have an empty entry in the table

Also, your Solution 1 and Solution 3 are different, but you say they are both correct. That's confusing to me.

If this is all correct, I'd be curious what your downstream step is. Your data are currently stored in what is known as tidy format, and that is almost always better for analysis.

Simon 2023년 4월 1일

Thanks for your quick response. The result from Solution 1 needs more processing to remove redundancy in each row. Solution 3 has the correct result.

My real data have more than hundreds of thousands of rows. Its first column stores 'account codes', the second column, 'account definition', and the third column is financial numerical values. For example, '1XXXXX' is the code, and 'Assets' is the account definition. These two should have a perfect 1-1 relationship. However, owing to human factor, the actual entry for 'account definition' for a given account code might slightly vary. For example, 'Assets' may be keyed in as 'Assest', or 'Aset'.

The code's downstream step is for me to visually check if there is anything odd about the 'account codes'-'account definition'. There are only around a hundred of unique 'account codes', which is more managable to human inspection than the original super-tall table.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Stephen23 2023년 4월 1일

편집: Stephen23 2023년 4월 27일

MATLAB Online에서 열기

0 개 추천

Using UNSTACK is quite a neat solution because it will automatically pad different-length data to the same number of columns, adding "missing" values as required. This is otherwise fiddly to replicate. But to use UNSTACK, we need to add a variable that tells UNSTACK which columns to move the data into:

C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
     code       fruit   
    ______    __________

    {'q1'}    {'apple' }
    {'q1'}    {'appl'  }
    {'q2'}    {'banana'}
    {'q3'}    {'orange'}
    {'q3'}    {'orang' }
    {'q3'}    {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);  % note1
F = @(n)(1:nnz(n==G)).'; % note1
U.count = cell2mat(arrayfun(F,unique(G),'uni',0)) % note1
U = 5×3 table
     code       fruit       count
    ______    __________    _____

    {'q1'}    {'apple' }      1  
    {'q1'}    {'appl'  }      2  
    {'q2'}    {'banana'}      1  
    {'q3'}    {'orange'}      1  
    {'q3'}    {'orang' }      2  
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
     code         x1            x2    
    ______    __________    __________

    {'q1'}    {'apple' }    {'appl'  }
    {'q2'}    {'banana'}    {0×0 char}
    {'q3'}    {'orange'}    {'orang' }

note1: This just gives a unique index to each element of a group. Astonishingly there does not seem to be an easy inbuilt way to achieve this... does anyone have any tips?: e.g. [1,1,1,2,2,1] -> [1,2,3,1,2,4] .

EDIT: I found a neater way:

G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum);

댓글 수: 11
이전 댓글 9개 표시 이전 댓글 9개 숨기기

Simon 2023년 4월 4일

MATLAB Online에서 열기

Thanks for pointing out the solution. The codes work fine now. Testing with the more unordered data below, I see why 'stable' option will cause the error. When U is not sorted, cell2mat(cell array of 'count') will put counts in wrong group.

C = {'q1', 'q3','q1', 'q2', 'q3', 'q3', 'q1', 'q3';
    'apple', 'orange','appl', 'banana', 'orang','orange', 'apple', 'oranges'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'});
U = unique(T,'rows','stable');
G = findgroups(U.code);
F = @(n)(1:nnz(n==G))';
count = arrayfun(F,unique(G),'UniformOutput',false);
% count = {1;2};{1};{1;2;3} is correct.
    {2×1 double}
    {[       1]}
    {3×1 double}
U.count = cell2mat(count)
U = 6×3 table
     code        fruit       count
    ______    ___________    _____

    {'q1'}    {'apple'  }      1  
    {'q3'}    {'orange' }      2  
    {'q1'}    {'appl'   }      1  
    {'q2'}    {'banana' }      1  
    {'q3'}    {'orang'  }      2  
    {'q3'}    {'oranges'}      3  
% when U is not sorted and cell2mat( ) does 
% what it's supposed to do, error occurs. 

Stephen23 2023년 4월 6일

MATLAB Online에서 열기

Here is another UNSTACK-based approach, generating the group indices using ACCUMARRAY:

C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
     code       fruit   
    ______    __________

    {'q1'}    {'apple' }
    {'q1'}    {'appl'  }
    {'q2'}    {'banana'}
    {'q3'}    {'orange'}
    {'q3'}    {'orang' }
    {'q3'}    {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);
F = @(a){cumsum(a)};
U.count = cell2mat(accumarray(G,ones(size(G)),[],F))
U = 5×3 table
     code       fruit       count
    ______    __________    _____

    {'q1'}    {'appl'  }      1  
    {'q1'}    {'apple' }      2  
    {'q2'}    {'banana'}      1  
    {'q3'}    {'orang' }      1  
    {'q3'}    {'orange'}      2  
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
     code         x1            x2    
    ______    __________    __________

    {'q1'}    {'appl'  }    {'apple' }
    {'q2'}    {'banana'}    {0×0 char}
    {'q3'}    {'orang' }    {'orange'}

Simon 2023년 4월 9일

편집: Simon 2023년 4월 9일

MATLAB Online에서 열기

I have tried on my real data with the three algorithms, one by myself, the other twos by Stephen23, and organized them into three functions. So it would be more convenient for anyone to use them as practical solution or as learning materials.

The data is a table with about 160,000 rows and 1600 unique 'ifcode'. Each algorithm took elapsed time as

crosstab_forloop: 1.81 seconds

unstack_applyfun: 1.47 seconds

unstack_accumarray: 1.50 seconds.

Here are the three functions:

% T is a table, with one column called 'ifcode', 
% the other column called 'account'.
function gT = crosstab_forloop(T)
[tb,~,~,lbs] = crosstab(T.ifcode, T.account);
m = size(tb,1);
header = lbs(1:m,1);
accounts = lbs(:,2);
gT = table('Size',[4000,4], 'VariableTypes', {'string','string','string','string'}, ...
    'VariableNames',{'ifcode', 'x1', 'x2', 'x3'});
for i=1:m
    tmp = accounts(tb(i,:)>0)';
    l = size(tmp,2);
    gT(i,"ifcode") = header(i);
    gT(i, 2:l+1) = tmp;
end
gT = rmmissing(gT, "MinNumMissing",4);
end
function U = unstack_arrayfun(T)
U = unique(T, 'rows');
G = findgroups(U.ifcode);
F = @(n)(1:nnz(n==G))'; 
U.withindex = cell2mat(arrayfun(F,unique(G),'UniformOutput',false)); % note1
U = unstack(U,"account","withindex","VariableNamingRule","modify");
% note1: This just gives a unique index to each element of a group.
% algorithm credit belongs to Stephen23
end
 
function U = unstack_accumarray(T)
U = unique(T,'rows');
G = findgroups(U.ifcode);
F = @(a){cumsum(a)};
U.withindex = cell2mat(accumarray(G,ones(size(G)),[],F));
U = unstack(U,"account","withindex","VariableNamingRule","modify");
% algorithm credit belongs to Stephen23
end

Stephen23 2023년 4월 10일

편집: Stephen23 2023년 4월 11일

MATLAB Online에서 열기

"So it would be more convenient for anyone to use them as practical solution or as learning materials.

Most likely the CELL2MAT slows them down... you never wrote that you needed a particularly fast approach, so I did not consider that in my code (instead aiming for "reasonably compact", which is what most users on this forum seem to want). For a "reasonably fast" approach try replacing CELL2MAT with a comma-separated list.

Something like this might be even faster:

C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
     code       fruit   
    ______    __________

    {'q1'}    {'apple' }
    {'q1'}    {'appl'  }
    {'q2'}    {'banana'}
    {'q3'}    {'orange'}
    {'q3'}    {'orang' }
    {'q3'}    {'orange'}
[U,~,X] = unique(T.code);
for k = 1:numel(U)
    V = unique(T{k==X,'fruit'});
    U(k,2:1+numel(V)) = V;
end
W = cell2table(U)
W = 3×3 table
      U1          U2             U3     
    ______    __________    ____________

    {'q1'}    {'appl'  }    {'apple'   }
    {'q2'}    {'banana'}    {0×0 double}
    {'q3'}    {'orang' }    {'orange'  }

Simon 2023년 4월 13일

When run over my real data, this soulution took elassed time of only 3.67 seconds. Your verctorized codes have both elegance and speed superiority over this forloop algorithm. Probably, the unique( ) funttion in the for-loop runs slows thing down, or maybe it's the array segmental assignment (U(k,2:...) = V. In my experience, value assignment to a segment of array, cell, and table causes sluggishness.

Great thanks for it nevertheless.

Stephen23 2023년 4월 27일

편집: Stephen23 2023년 4월 27일

MATLAB Online에서 열기

I thought of another approach based on GROUPTRANSFORM:

https://www.mathworks.com/help/matlab/ref/double.grouptransform.html

As mentioned in my answer, the desired transformation is [1,1,1,2,2,1] -> [1,2,3,1,2,4] .

G = [1;1,;1;2;2;1]; % must be column vector
Y = grouptransform(ones(size(G)),G,@cumsum)
Y = 6×1
     1
     2
     3
     1
     2
     4

Nice, it seems to work as we want. However in this case G luckily consists of integers 1..N. In all other cases we need to use e.g. FINDGROUPS first. Lets try it with the fake data that I used in my answer:

C = {'q1','q1','q2','q3','q3','q3';'apple','appl','banana','orange','orang','orange'}.';
T = cell2table(C, 'VariableNames', {'code','fruit'})
T = 6×2 table
     code       fruit   
    ______    __________

    {'q1'}    {'apple' }
    {'q1'}    {'appl'  }
    {'q2'}    {'banana'}
    {'q3'}    {'orange'}
    {'q3'}    {'orang' }
    {'q3'}    {'orange'}
U = unique(T,'rows');
G = findgroups(U.code);
U.count = grouptransform(ones(size(G)),G,@cumsum)
U = 5×3 table
     code       fruit       count
    ______    __________    _____

    {'q1'}    {'appl'  }      1  
    {'q1'}    {'apple' }      2  
    {'q2'}    {'banana'}      1  
    {'q3'}    {'orang' }      1  
    {'q3'}    {'orange'}      2  
U = unstack(U,"fruit","count", "VariableNamingRule","modify")
U = 3×3 table
     code         x1            x2    
    ______    __________    __________

    {'q1'}    {'appl'  }    {'apple' }
    {'q2'}    {'banana'}    {0×0 char}
    {'q3'}    {'orang' }    {'orange'}

댓글을 달려면 로그인하십시오.

Answer 2

Peter Perkins 2023년 4월 5일

MATLAB Online에서 열기

1 개 추천

I'm having trouble understanding the desireed output, but others have created what is essentially a crosstabulation of counts, so, new in R2023a

>> T = cell2table(C, 'VariableNames', {'code', 'fruit'});
>> pivot(T,Rows="code",Columns="fruit")
ans =
  3×7 table
     code     appl    apple    banana    orang    orange    oranges
    ______    ____    _____    ______    _____    ______    _______
    {'q1'}     1        2        0         0        0          0   
    {'q2'}     0        0        1         0        0          0   
    {'q3'}     0        0        0         1        2          1   

As cyclist points out, there are a bunch of empty bins, so the original "tidy" format may be more useful. To me, this looks like a case of "fruit ought to be categorical, and you ought to apply mergecats to clean up those typos/different spellings".

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

Peter Perkins 2023년 4월 6일

편집: Peter Perkins 2023년 4월 6일

MATLAB Online에서 열기

I'm not saying that this is definitely better, just that it's worth considering. One of the purposes of categorical is to make it simpler to clean up data like this.

C = {'q1', 'q3','q1', 'q2', 'q3', 'q3', 'q1', 'q3';
     'apple', 'orange','appl', 'banana', 'orang','orange', 'apple', 'oranges'}';
T = cell2table(C, 'VariableNames', {'code', 'fruit'})
T = 8×2 table
     code        fruit   
    ______    ___________

    {'q1'}    {'apple'  }
    {'q3'}    {'orange' }
    {'q1'}    {'appl'   }
    {'q2'}    {'banana' }
    {'q3'}    {'orang'  }
    {'q3'}    {'orange' }
    {'q1'}    {'apple'  }
    {'q3'}    {'oranges'}
T = convertvars(T,["code" "fruit"],"categorical")
T = 8×2 table
    code     fruit 
    ____    _______

     q1     apple  
     q3     orange 
     q1     appl   
     q2     banana 
     q3     orang  
     q3     orange 
     q1     apple  
     q3     oranges
categories(T.fruit)
ans = 6×1 cell array
    {'appl'   }
    {'apple'  }
    {'banana' }
    {'orang'  }
    {'orange' }
    {'oranges'}
T.fruit = mergecats(T.fruit,["apple" "appl"]);
T.fruit = mergecats(T.fruit,["orange" "orang" "oranges"]);
T
T = 8×2 table
    code    fruit 
    ____    ______

     q1     apple 
     q3     orange
     q1     apple 
     q2     banana
     q3     orange
     q3     orange
     q1     apple 
     q3     orange
categories(T.fruit)
ans = 3×1 cell array
    {'apple' }
    {'banana'}
    {'orange'}
pivot(T,Rows="code",Columns="fruit")
ans = 3×4 table
    code    apple    banana    orange
    ____    _____    ______    ______

     q1       3        0         0   
     q2       0        1         0   
     q3       0        0         4   

Simon 2023년 4월 11일

Hi Peter, I think I agree with you. I did have tried with categorical, like categorical(A) .* categorical(B). I had a huntch that it would work. Then the thought of using unstack( ) took hold of me, so I started pursuing that idea. I had used stack( ) and unstack( ) to solved other problems and really enjoyed using them. Then, I ran into a wall, so I posted the problem here. Thankfully, a very neat solution crafted from unstack( ) was soon provided by Stenpen23.

Simon 2023년 5월 13일

@Stephen23

| > EDIT: I found a neater way:

|> G = findgroups(U.code);

|> U.count = grouptransform(ones(size(G)),G,@cumsum);

Sorry about being late response. I was overwhelmed by things. This is indeed a very neat solution. I used to think grouptransform( ) is quite limited in its function. But when it is put to work over a dummy/extra variable, it could be quite versatile in problem solving.

댓글을 달려면 로그인하십시오.

use unstack to reshape table with dummy variable (edited: alternative crosstab method)

댓글 수: 2
없음 표시 없음 숨기기

채택된 답변

댓글 수: 11
이전 댓글 9개 표시 이전 댓글 9개 숨기기

추가 답변 (1개)

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

use unstack to reshape table with dummy variable (edited: alternative crosstab method)

댓글 수: 2 없음 표시 없음 숨기기

채택된 답변

댓글 수: 11 이전 댓글 9개 표시 이전 댓글 9개 숨기기

추가 답변 (1개)

댓글 수: 7 이전 댓글 5개 표시 이전 댓글 5개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 11
이전 댓글 9개 표시 이전 댓글 9개 숨기기

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기