Unstack aggegate function and preset fill value do not match

Question

0 개 추천

Hello,

I'm trying to add large amounts of files spoanning different periods, stations and variables. I read in the files (some are column based, others row-based) and then add then by creating tables, stacking them, adding the stacked tables and unstack them again and creating a structure again. I do it this way to preserve a unique column (here time but this may be another variable). My test code is

% create structures with different dates

StructAdd.fdate=datenum('1984-01-01');

StructAdd.PO4=0.75;

StructIn.fdate=[datenum('1985-01-01'); datenum('1986-01-01')];

StructIn.NH4=[0.08; 0.055];

%-----------------

TableIn= struct2table(StructIn);

TableAdd= struct2table(StructAdd);

colvec=1:width(TableIn); colvec(1)=[]; % remove time column

StackIn = stack(TableIn,colvec,'IndexVariableName','detcode','NewDataVariableName','value');

colvec=1:width(TableAdd); colvec(1)=[]; % remove time column

StackAdd = stack(TableAdd,colvec,'IndexVariableName','detcode','NewDataVariableName','value');

StackTotal=vertcat(StackIn,StackAdd);

TableOut=unstack(StackTotal,'value',pivotfield,'AggregationFunction',@mean);

StructOut=table2struct(TableOut,'ToScalar',true);

%-----------------

disp('==============')

disp('PO4 data after adding first PO4 file with 1984-01-01: 0.75 data')

for i=1:length(StructOut.PO4)

disp([datestr(StructOut.fdate(i)) ' PO4: ' num2str(StructOut.PO4(i))])

end

disp('==============')

% Now add new data for dates already added for another variable

% --> if data is already present for this date the mean should be taken

StructAdd.fdate=datenum('1985-01-01');

StructAdd.PO4=0.2;

%-----------------

TableIn= struct2table(StructOut);

TableAdd= struct2table(StructAdd);

colvec=1:width(TableIn); colvec(1)=[]; % remove time column

StackIn = stack(TableIn,colvec,'IndexVariableName','detcode','NewDataVariableName','value');

colvec=1:width(TableAdd); colvec(1)=[]; % remove time column

StackAdd = stack(TableAdd,colvec,'IndexVariableName','detcode','NewDataVariableName','value');

StackTotal=vertcat(StackIn,StackAdd);

TableOut=unstack(StackTotal,'value',pivotfield,'AggregationFunction',@mean);

StructOut=table2struct(TableOut,'ToScalar',true);

%-----------------

disp('==============')

disp('PO4 data after adding second file with 1985-01-01: 0.2 data')

for i=1:length(StructOut.PO4)

disp([datestr(StructOut.fdate(i)) ' PO4: ' num2str(StructOut.PO4(i))])

end

disp('==============')

If I use @mean as the aggregate function in unstack I get

==============

PO4 data after adding first PO4 file with 1984-01-01: 0.75 data

01-Jan-1985 PO4: NaN

01-Jan-1986 PO4: NaN

01-Jan-1984 PO4: 0.75

==============

PO4 data after adding second file with 1985-01-01: 0.2 data

01-Jan-1985 PO4: NaN

01-Jan-1986 PO4: NaN

01-Jan-1984 PO4: 0.75

==============

If I use @nanmean as the aggregate function in unstack I get

==============

PO4 data after adding first PO4 file with 1984-01-01: 0.75 data

01-Jan-1985 PO4: 0

01-Jan-1986 PO4: 0

01-Jan-1984 PO4: 0.75

==============

PO4 data after adding second file with 1985-01-01: 0.2 data

01-Jan-1985 PO4: 0.1

01-Jan-1986 PO4: 0

01-Jan-1984 PO4: 0.75

==============

But neither of these is the answer I am looking for, which should give

==============

PO4 data after adding first PO4 file with 1984-01-01: 0.75 data

01-Jan-1985 PO4: NaN

01-Jan-1986 PO4: NaN

01-Jan-1984 PO4: 0.75

==============

PO4 data after adding second file with 1985-01-01: 0.2 data

01-Jan-1985 PO4: 0.2

01-Jan-1986 PO4: NaN

01-Jan-1984 PO4: 0.75

==============

The Matlab guidance for unstack states that the aggregate function chosen sets the fillvalue: I don't see any options to set this separately. How can I have NaN fillvalues and yet use nanmean as the aggregate function??

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Seth Furman 2021년 11월 17일

MATLAB Online에서 열기

0 개 추천

Thank you for clarifying.

You can customize the fill value used in unstack by writing a custom aggregation function.

"If there are no data values to aggregate, because there are no data values corresponding to a given indicator value in ivar after unstacking, then unstack must fill an empty element in the unstacked output table. In that case, unstack calls the aggregation function with an empty array as input. The value that unstack fills in depends on what the aggregation function returns when its input is an empty array."

https://www.mathworks.com/help/matlab/ref/unstack.html#btzaco8-1-AggregationFunction

For example,

t = table(categorical(["a";"b";"c"]), [4;5;6], [1;2;3])
t = 3×3 table
    Var1    Var2    Var3
    ____    ____    ____

     a       4       1  
     b       5       2  
     c       6       3  
unstack(t, "Var2", "Var1", "AggregationFunction", @nanmean)
ans = 3×4 table
    Var3     a      b      c 
    ____    ___    ___    ___

     1        4    NaN    NaN
     2      NaN      5    NaN
     3      NaN    NaN      6
unstack(t, "Var2", "Var1", "AggregationFunction", @myNanmean)
Warning: When a group has no rows for a given value of the indicator variable, UNSTACK calls the supplied aggregation function with an input of size 0-by-1 instead of automatically filling the value. Review the output to ensure desired result is obtained. This warning might be removed in a future release.
ans = 3×4 table
    Var3    a    b    c
    ____    _    _    _

     1      4    0    0
     2      0    5    0
     3      0    0    6
function x = myNanmean(x)
    if isempty(x) % return a fill-value
        x = 0;
    else
        x = nanmean(x);
    end
end

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Sonja van Leeuwen 2021년 11월 18일

This works great, thanks!

댓글을 달려면 로그인하십시오.

Answer 2

Seth Furman 2021년 11월 16일

MATLAB Online에서 열기

0 개 추천

1. We highly recommend that you use datetime instead of datenum. Please let me know if you can't use datetime for some reason as we'd like to understand why.

e.g.

datetime(["1985-01-01"; "1986-01-01"], "Format", "uuuu-MM-dd")
ans = 2×1 datetime array
   1985-01-01
   1986-01-01

2. When your table data has time as a variable, it is often more more convenient to use a timetable instead.

e.g.

StructIn.fdate = datetime(["1985-01-01"; "1986-01-01"], "Format", "uuuu-MM-dd");
StructIn.NH4 = [0.08; 0.055];
TableIn = struct2table(StructIn);
TableIn = table2timetable(TableIn)
TableIn = 2×1 timetable
      fdate        NH4 
    __________    _____

    1985-01-01     0.08
    1986-01-01    0.055

3. This workflow of calling stack, unstack, and table2struct is unnecessary. Use outerjoin instead.

StructAdd.fdate = datetime("1984-01-01", "Format", "uuuu-MM-dd");
StructAdd.PO4 = 0.75;
StructIn.fdate = datetime(["1985-01-01"; "1986-01-01"], "Format", "uuuu-MM-dd");
StructIn.NH4 = [0.08; 0.055];
TableIn = struct2table(StructIn);
TableIn = table2timetable(TableIn)
TableIn = 2×1 timetable
      fdate        NH4 
    __________    _____

    1985-01-01     0.08
    1986-01-01    0.055
TableAdd1 = struct2table(StructAdd);
TableAdd1 = table2timetable(TableAdd1)
TableAdd1 = timetable
      fdate       PO4 
    __________    ____

    1984-01-01    0.75
TableOut = outerjoin(TableIn, TableAdd1)
TableOut = 3×2 timetable
      fdate        NH4     PO4 
    __________    _____    ____

    1984-01-01      NaN    0.75
    1985-01-01     0.08     NaN
    1986-01-01    0.055     NaN
StructAdd.fdate = datetime("1985-01-01", "Format", "uuuu-MM-dd");
StructAdd.PO4=0.2;
TableAdd2 = struct2table(StructAdd);
TableAdd2 = table2timetable(TableAdd2)
TableAdd2 = timetable
      fdate       PO4
    __________    ___

    1985-01-01    0.2
TableOut = outerjoin(TableOut, TableAdd2)
TableOut = 3×3 timetable
      fdate        NH4     PO4_TableOut    PO4_TableAdd2
    __________    _____    ____________    _____________

    1984-01-01      NaN        0.75             NaN     
    1985-01-01     0.08         NaN             0.2     
    1986-01-01    0.055         NaN             NaN     

Alternatively, you could use synchronize, since you're combining your data by time.

TableOut = synchronize(TableIn, TableAdd1)
TableOut = 3×2 timetable
      fdate        NH4     PO4 
    __________    _____    ____

    1984-01-01      NaN    0.75
    1985-01-01     0.08     NaN
    1986-01-01    0.055     NaN
TableOut = synchronize(TableOut, TableAdd2)
TableOut = 3×3 timetable
      fdate        NH4     PO4_TableOut    PO4_TableAdd2
    __________    _____    ____________    _____________

    1984-01-01      NaN        0.75             NaN     
    1985-01-01     0.08         NaN             0.2     
    1986-01-01    0.055         NaN             NaN     

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Sonja van Leeuwen 2021년 11월 17일

The reason I use datenum is because of legacy code, but also because it is nice and simple and I don't require anything more. Datetime is quite fancy with a lot more options then I need, datenum has all the right functionality and nothing more. The reason I don't use timetable is because it is not always time in the pivotting column (i.e. time might not feature at all in the table).

But the whole point is that I want the final table / structure to have only 1 column / field for PO4 (with entries 0.75 0.2 NaN). This is why I use the unstack, vertcat and stack method: outerjoin and synchronise do not seem to do this. So my main question remains: how can I end up with only 1 PO4 column (0.75 0.2 NaN). And how can I set a fillvalue for unstack separately from the aggregate function?

댓글을 달려면 로그인하십시오.

Unstack aggegate function and preset fill value do not match

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Unstack aggegate function and preset fill value do not match

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기