Filter or delete data from (time)table depending of missing values

조회 수: 8 (최근 30일)
Hello dear community (from which I already read a lot - thank you).
I've got datasets (solar irradiation - GHI) for approx. 4 years with a 10mn timestep ; and it's full of holes. I need to get a hourly average of GHI and I want to exclude/delete any "hour" that hasn't 6 GHI values (for 0, 10, 20, 30, 40 and 50 minutes within eah hour, of each day, of each month, of each year).
I joined a sample of one dataset. I already know how to correct some flaws of the file, like to round some datetime value that are for exemple (HH:MM:SS) 10:09:59 instead of 10:10:10 (with dateshift 'start','minute','nearest').
I already tried "retime" by creating a timetable, filled with NaN values for missing GHI but this is not convenient as I also need access to the datetime vector (once filtered) for comparisons with other datasets. Is there anyway to avoid a loop (datasets sizes vary from 200k to 300k rows) ? If not, a hint would be appreciated.
Thank you very much
  댓글 수: 10
Paul MAUGER
Paul MAUGER 2021년 4월 12일
Thanks for the advice. I indeed thought I would gain some speed not keeping tables and array I don't need anymore.
Walter Roberson
Walter Roberson 2021년 4월 13일
Consider for example
bar = [1 2 3];
bar(2)
clear bar
bar(2)
if rand() < 0.5; bar = [4 5 6]; end
bar(2)
The first bar(2), bar is a variable and indexing is to be used.
The second bar(2), bar has been cleared. But look -- bar() is the name of a function, so the second bar(2) invokes the graphics function!
The third bar(2)... bar might have been assigned a value, or it might not have been, so the third bar(2) might be indexing or it might be a function call...
If there had been no clear of bar, then it would not be necessary to re-resolve the name and no need to have to insert the hooks to deal with the possibility that you are dealing with function instead of variable.
In the case where a variable is never going to be referred to again in the same function, then it is not clear whether there is a performance penalty for clearing the variable as opposed to assigning empty to it, but Mathworks has specifically guided to avoid clearing variables inside a loop.

댓글을 달려면 로그인하십시오.

채택된 답변

Walter Roberson
Walter Roberson 2021년 4월 10일
[Copying answer down from comments]
Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));
(This was a bit more difficult to develop than might appear from the code.)
Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.
Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)
The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .
Another way to write the code without that hack would be something like
Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Timetables에 대해 자세히 알아보기

태그

제품


릴리스

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by