Fill a timetable outside the gaps

조회 수: 13 (최근 30일)
giannit
giannit 2021년 5월 14일
댓글: giannit 2021년 5월 22일
Consider the following timetable, which is just a toy to explain the problem
dates = [datetime(2020,1,1) datetime(2020,1,1) datetime(2020,1,2) datetime(2020,1,2) datetime(2020,1,3) datetime(2020,1,3) datetime(2020,1,4)];
t = timetable(dates',["a" "c" "a" "b" "a" "a" "c"]',[2 1 3 2 1 2 1]')
Time Var1 Var2
___________ ____ ____
01-Jan-2020 "a" 2
01-Jan-2020 "c" 1
02-Jan-2020 "a" 3
02-Jan-2020 "b" 2
03-Jan-2020 "a" 1
03-Jan-2020 "a" 2
04-Jan-2020 "c" 1
For each date and for each unique string in Var1 I would like to obtain the sum of the values in Var2. That is:
(1) if for a specific date a string does not appear, then 0 must be returned
(2) if for a specific date a string appears only once, then the corresponding value in Var2 must be returned
(3) if for a specific date a string appears multiple times, then the sum of the corresponding values in Var2 must be returned
The function retime satisfy (2) and (3) completely, but (1) only partially in the sense that it returns 0 only if the string appears in a previous date AND in a following date, that is retime only fills the gaps.
In the following the row with 04-Jan-2020 and 0 is missing
retime(t(t.Var1=="a","Var2"),'daily','sum')
Time Var2
___________ ____
01-Jan-2020 2
02-Jan-2020 3
03-Jan-2020 3
In the following the rows with dates 1, 3, 4 january (and 0 in Var2) are missing
retime(t(t.Var1=="b","Var2"),'daily','sum')
Time Var2
___________ ____
02-Jan-2020 2
The following is correct
retime(t(t.Var1=="c","Var2"),'daily','sum')
Time Var2
___________ ____
01-Jan-2020 1
02-Jan-2020 0
03-Jan-2020 0
04-Jan-2020 1
Is there an easy way to resolve the problem and without using for loops? The data I'm working with has thousands of rows and tens of columns, I tried with loops but it's time consuming.
In the toy example an easy workaround is to manually add the missing strings at the start and at the end of the timetable, in such a way we can extend the gaps to cover all the datetimes and then retime can fill all the missing values.
However, this is easy only for timetables with a simple structure, moreover it might have some unwanted side effects
t = [timetable(datetime(2020,1,1),"b",0) ; t ; timetable([datetime(2020,1,4);datetime(2020,1,4)],["a";"b"],[0;0])]
Time Var1 Var2
___________ ____ ____
01-Jan-2020 "b" 0
01-Jan-2020 "a" 2
01-Jan-2020 "c" 1
02-Jan-2020 "a" 3
02-Jan-2020 "b" 2
03-Jan-2020 "a" 1
03-Jan-2020 "a" 2
04-Jan-2020 "c" 1
04-Jan-2020 "a" 0
04-Jan-2020 "b" 0

채택된 답변

Siddharth Bhutiya
Siddharth Bhutiya 2021년 5월 19일
You could use groupsummary to do this.
In your case your grouping variables would be Time and Var1 and the aggregation method you want to use would be sum. Since you want all permutations of the grouping variables to show up in your output, you can specify the "IncludeEmptyGroups" as true and that should give you the desired output.
groupsummary(t,["Time","Var1"],"sum","Var2","IncludeEmptyGroups",true)
ans =
12×4 table
Time Var1 GroupCount sum_Var2
___________ ____ __________ ________
01-Jan-2020 "a" 1 2
01-Jan-2020 "b" 0 0
01-Jan-2020 "c" 1 1
02-Jan-2020 "a" 1 3
02-Jan-2020 "b" 1 2
02-Jan-2020 "c" 0 0
03-Jan-2020 "a" 2 3
03-Jan-2020 "b" 0 0
03-Jan-2020 "c" 0 0
04-Jan-2020 "a" 0 0
04-Jan-2020 "b" 0 0
04-Jan-2020 "c" 1 1
  댓글 수: 5
Siddharth Bhutiya
Siddharth Bhutiya 2021년 5월 20일
편집: Siddharth Bhutiya 2021년 5월 20일
@giannit When I first read your comment about missing date my thought was well for groupsummary Time is just a grouping variable, so as far as it is concerned there is no difference between 02-Jan-2020 or 5-Jan-2020 or even 20-May-2021, all these are just values for it so what should it consider as a missing value?
But then I thought about it some more and took another look at the problem and it made more sense. Technically 02-Jan-2020 is not missing but in your final output you want the Time variable to be in increments of 1 day starting from min(Time) to the max(Time). This seemed like a perfectly reasonable thing to do when someone is trying to summarize their data, so I went back to the documentation page for groupsummary and found the groupbins argument. So it seems that you can get the desired results by doing the following
T = timetable(datetime(2020,1,[1 1 1 3 3 4])',["a" "c" "b" "a" "a" "c"]',[2 1 2 1 2 1]');
groupsummary(T,["Time","Var1"],["day","none"],"sum","Var2","IncludeEmptyGroups",true)
ans =
12×4 table
day_Time Var1 GroupCount sum_Var2
___________ ____ __________ ________
01-Jan-2020 "a" 1 2
01-Jan-2020 "b" 1 2
01-Jan-2020 "c" 1 1
02-Jan-2020 "a" 0 0
02-Jan-2020 "b" 0 0
02-Jan-2020 "c" 0 0
03-Jan-2020 "a" 2 3
03-Jan-2020 "b" 0 0
03-Jan-2020 "c" 0 0
04-Jan-2020 "a" 0 0
04-Jan-2020 "b" 0 0
04-Jan-2020 "c" 1 1
This method should also give you the correct values for the GroupCounts as opposed to the ad hoc method in the previous comment. Hope this helps !!
giannit
giannit 2021년 5월 22일
Wow perfect! I hoped for a one line command but I was lost in the documentations, thank you very much for the big help!
To convert the table to a timetable, is it fine to do this?
T = timetable(datetime(2020,1,[1 1 1 3 3 4])',["a" "c" "b" "a" "a" "c"]',[2 1 2 1 2 1]');
t = groupsummary(T,["Time","Var1"],["day","none"],"sum","Var2","IncludeEmptyGroups",true);
t.day_Time = datetime(string(t.day_Time));
tt = table2timetable(t)
thanks again!

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

dpb
dpb 2021년 5월 14일
편집: dpb 2021년 5월 14일
Illustrate the concept --
tt.Var1=categorical(tt.Var1); % convert Var1 to categorical as rightfully is
dRef=datetime(tt.Time(1):tt.Time(end)).'; % get the full date vector
vRef=unique(tt.Var1); % and the unique values of Var1
% build a reference timetable of all times/characteristics
ttRef=timetable(reshape(repmat(dRef.',size(vRef,1),1),[],1), repmat(vRef,size(dRef,1),1));
ttRef.Var2=zeros(height(ttRef),1); % add Var2 column of zeros
[~,ib]=ismember(tt(:,{'Var1'}),ttRef(:,{'Var1'})); % locate the ones that are present in reference
ttRef.Var2(ib)=tt.Var2; % and replace with start
The above for the example leads to a fully-augmented timetable of:
ttRef =
12×2 timetable
Time Var1 Var2
___________ ____ ____
01-Jan-2020 a 2
01-Jan-2020 b 0
01-Jan-2020 c 1
02-Jan-2020 a 3
02-Jan-2020 b 2
02-Jan-2020 c 0
03-Jan-2020 a 2
03-Jan-2020 b 0
03-Jan-2020 c 0
04-Jan-2020 a 0
04-Jan-2020 b 0
04-Jan-2020 c 1
K>>
for which the previous rowfun solution will work
  댓글 수: 1
dpb
dpb 2021년 5월 15일
NB: The above construction left out the insertion of the duplicated initial rows...remember to reinsert them as well.

댓글을 달려면 로그인하십시오.


dpb
dpb 2021년 5월 14일
K>> rowfun(@sum,t,"InputVariables",'Var2','GroupingVariables',{'Time','Var1'})
ans =
6×3 timetable
Time Var1 GroupCount Var3
___________ ____ __________ ____
01-Jan-2020 "a" 1.00 2.00
01-Jan-2020 "c" 1.00 1.00
02-Jan-2020 "a" 1.00 3.00
02-Jan-2020 "b" 1.00 2.00
03-Jan-2020 "a" 2.00 3.00
04-Jan-2020 "c" 1.00 1.00
K>>
I'd just fill in the missing dates using ismember or setdiff to create the vector of missing dates as compared to a full vector from the first to last date in the timetable. That's still a set of vector operations on the resulting table.
Is there to be a "0" entry for each value of Var1, too, I suppose?
SOMEWHAT LATER ADDENDUM
I didn't try to write it, but seems you could use retime and custom function to fill the missing entries in the original timetable with zeros for each missing date/category; perhaps in conjunction with rowfun to using the grouping variable to cover the categories.
All of these, of course, have the looping construct in them, just at a lower level...

카테고리

Help CenterFile Exchange에서 Tables에 대해 자세히 알아보기

제품


릴리스

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by