Merging tables using outerjoin doesn't work

Question

0 개 추천

Hi all,

I have two tables which I would like to merge/join. These tables have a series of shared columns and then one or more columns may exist in only one table.

I know I can use outerjoin to achieve what I want, like this:

a = rand(10,1);
b = rand(10,1);
c = rand(10,1);
d = rand(10,1);
e = rand(10,1);
f = repmat(mat2cell(a,10,1),10,1);
t1 = table(ones(10,1),ones(10,1),ones(10,1),a,b,c,d,e,f)
t2 = table(ones(10,1),ones(10,1),ones(10,1).*2,a,b,c)
t3 = outerjoin(t1,t2,'MergeKeys',true)

The problem is that this doesn't work on the tables I am using and I'm not sure why. I have attached two tables here as a sample. They have the exact same columns, but different rows, so the task is actually simpler than the above example. However, when I use outerjoin it makes new variables named after the tables instead of combining them vertically, and the resulting table has 2 rows but twice as many columns as the inputs:

t3 = outerjoin(t1,t2,'Keys',[1:3],'MergeKeys',true)

Is there some feature of these tables that I'm missing which prevents them from being joined in the way I want?

Thanks for any help,

R.

댓글 수: 5
이전 댓글 3개 표시 이전 댓글 3개 숨기기

Neuropragmatist 2019년 4월 17일

MATLAB Online에서 열기

Hi Peter,

Here is a more replicable example. This code:

a = rand(3,1);
b = rand(3,1);
c = rand(3,1);
d = rand(3,1);
e = rand(3,1);
f = repmat(mat2cell(a,3,1),3,1);
t1 = table(ones(3,1),a,b,c)
t2 = table(ones(3,1).*2,a,b)
t3 = outerjoin(t1,t2,'MergeKeys',true)

Produces this output:

t1 =
  3×4 table
    Var1            a                    b                    c        
    ____    _________________    _________________    _________________
     1      0.212268049890771    0.144940183355874    0.786990424482484
     1       0.51189873810689    0.628664443166539    0.515958765112436
     1      0.650024454445417    0.441312076439047    0.343697121302907
t2 =
  3×3 table
    Var1            a                    b        
    ____    _________________    _________________
     2      0.212268049890771    0.144940183355874
     2       0.51189873810689    0.628664443166539
     2      0.650024454445417    0.441312076439047
t3 =
  6×4 table
    Var1            a                    b                    c        
    ____    _________________    _________________    _________________
     1      0.212268049890771    0.144940183355874    0.786990424482484
     1       0.51189873810689    0.628664443166539    0.515958765112436
     1      0.650024454445417    0.441312076439047    0.343697121302907
     2      0.212268049890771    0.144940183355874                  NaN
     2       0.51189873810689    0.628664443166539                  NaN
     2      0.650024454445417    0.441312076439047                  NaN

So table t3 is the vertical concatenation of t1 and t2, but where one of the tables was missing a column (column c in t2), this column has been filled with NaNs.

I'm trying to do the same thing with two tables containing more complex/varied data, but outerjoin works differently. For instance:

a = rand(3,1);
b = repmat({a},3,1);
c = repmat({'hello'},3,1);
d = repmat({rand(10,10)},3,1);
t1 = table(ones(3,1),a,b,c,d)
t2 = table(ones(3,1).*2,a,b)
t3 = outerjoin(t1,t2,'Keys',1,'MergeKeys',true)

produces this output:

t1 =
  3×5 table
    Var1            a                 b             c             d       
    ____    _________________    ____________    _______    ______________
     1      0.749630891182043    [3×1 double]    'hello'    [10×10 double]
     1      0.175095755248745    [3×1 double]    'hello'    [10×10 double]
     1      0.172219045349798    [3×1 double]    'hello'    [10×10 double]
t2 =
  3×3 table
    Var1            a                 b      
    ____    _________________    ____________
     2      0.749630891182043    [3×1 double]
     2      0.175095755248745    [3×1 double]
     2      0.172219045349798    [3×1 double]
t3 =
  6×7 table
    Var1          a_t1               b_t1           c             d                 a_t2               b_t2    
    ____    _________________    ____________    _______    ______________    _________________    ____________
     1      0.749630891182043    [3×1 double]    'hello'    [10×10 double]                  NaN    []          
     1      0.175095755248745    [3×1 double]    'hello'    [10×10 double]                  NaN    []          
     1      0.172219045349798    [3×1 double]    'hello'    [10×10 double]                  NaN    []          
     2                    NaN    []              ''         []                0.749630891182043    [3×1 double]
     2                    NaN    []              ''         []                0.175095755248745    [3×1 double]
     2                    NaN    []              ''         []                0.172219045349798    [3×1 double]

So now t3 is the vertical concatenation of t1 and t2, but some of the columns are duplicated and renamed instead of being filled with NaNs etc. For instance, in the first example we had one column a in the final table, but in the second example we have two columns, a_t1 and a_t2.

I'm guessing this is because the columns contain different data types, and Matlab can't know what to fill the empty spaces with? but that seems really weird because Matlab has correctly filled the empty spaces (i.e. bottom half of a_t1).

How can I join the tables in the second example to have non-duplicated columns as in the first example?

Thanks for any help,

R.

Neuropragmatist 2019년 4월 17일

That's what I am doing now, I just made a small utility to fill the missing columns as I go. Then I can just concatenate them at the end.

But how come the first example works and does what I want, but the second doesn't? Is it really just the type of data contained in the tables? I would like to know for the future.

I also don't understand why there is not a function that does do what I want specifically - it seems that Matlab can determine the correct data to fill the excess space with, so it should be able to vertically concatenate based on columns, or at least it seems trivial to get to that from outerjoin.

Thanks,

R.

Peter Perkins 2019년 4월 18일

편집: Peter Perkins 2019년 4월 18일

MATLAB Online에서 열기

Again, outerjoin is not intended to do vertical concatenation. It matches up rows by key values and horizontally concatenates. It is not the tool you are looking for. I encourage you to read the documentation for all three join functions.

It's possible to hack outerjoin to do vertical concatenation in some sense, and you have done that either intentionally or unintentionally, I can't tell. The reason why your first example "works" is because you don't specify keys, so outerjoin uses all the matching var names, and your a and b vars happen to have the same values, and you merge keys. It happens to look like the right answer; that is completely by happenstance.

Your second example specifies the one key, and so outerjoin will horzcat the remianing vars: a,b,c,d from the first with a,b from the second. Duplicate names, so it adds suffixes. That's actually closer to a typical use of outerjoin, except that none of your key values match, and if no keys match, there's not any point in joining.

To do what you want, either add vars to make the two tables vertcat-able, or assign one off the end of the other.

>> t1 = array2table(rand(2,3))
t1 =
  2×3 table
     Var1       Var2       Var3  
    _______    _______    _______
    0.81472    0.12699    0.63236
    0.90579    0.91338    0.09754
>> t2 = array2table(rand(2,4))
t2 =
  2×4 table
     Var1       Var2       Var3       Var4  
    _______    _______    _______    _______
     0.2785    0.95751    0.15761    0.95717
    0.54688    0.96489    0.97059    0.48538
>> t2(3:4,t1.Properties.VariableNames) = t1
Warning: The assignment added rows to the table, but did not assign values to all of the table's existing variables. Those variables are
extended with rows containing default values. 
> In tabular/subsasgnParens (line 427)
  In tabular/subsasgn (line 64) 
t2 =
  4×4 table
     Var1       Var2       Var3       Var4  
    _______    _______    _______    _______
     0.2785    0.95751    0.15761    0.95717
    0.54688    0.96489    0.97059    0.48538
    0.81472    0.12699    0.63236          0
    0.90579    0.91338    0.09754          0
    

You will notice that that has padded Var4 with zeros; as with all numeric arrays anywhere in MATLAB, they are extended with zeros. So you'd have to assign NaN to those elements.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question