Hello, I have multiple rain stations in a catchemnt. I choose 3 of them which are close to eachother. I have 10 years of data from each station, but all 3 times series have data gaps. I choose 3 years of data without gaps (minor 1 or 2 days gap) within the time series to find the correlation as below. Then I did Multiple Linear regression as shown below. Questions. 1) Is this the write procedure fo fill data gaps in time sereis using Multiple Linear Regession (MLR).? 2) stats have 6 values. Kindly help me understanding what every value means.(If it was 1x4 then I know it is R-squared, F-stats, p , signfance). 3) when I used the equation and picked random value to find the predicted value y(which I know from time series) it is 0.4 and predicted value is 5.6. 4) If I have really strong correlation coefficient why predicted value is not so close? Any help would be appreciated. Thanking you in advance

filling data gaps in time series using multiple linear regression

Star Strider 2021년 11월 13일

I am not exactly certain what the code is doing, however it would likely be worth reading the data using readtimetable) and using the retime function to interpolate the missing values.

Using timetable arrays are best for this sort of work. All the necessary functions already exist, and they are relatively easy to learn to use effectively.

.

Muhammad Haris Siddiqui 2021년 11월 13일

Thanks for the reply.

I don't want to interpolate. I want to do Multiple Linear regression to fill the gaps in time series.

Yes, you are right for time stamped data timetable arrays are best but only for interpolation process not for regression.

Dave B 2021년 11월 13일

MATLAB Online에서 열기

wrt question 2: this is not stats, it's what the documentation calls bint (the lower and upper confidence intervals for the what the documentation calls b and that you have named R). For instance, the 95% CI estimate of the offset (.0685) is -.0432 to .1803.

To get stats, you want the 5th output. If you don't care about the others you can use ~:

[R,~,~,~,stats]=regress(...)

Star Strider 2021년 11월 13일

My pleasure!

‘I don't want to interpolate. I want to do Multiple Linear regression to fill the gaps in time series.’

That’s interpolation, even if you don’t want to call it that!

That technique uses linear interpolation, although much less efficiently than the existing functions available to use with timetable arrays.

If the intent is to do a multiple linear regression on the existing data, do the regression on the data interpolated using retime, or simply do the regression on the data with missing values. The regression algorithms have no idea what the data are, don’t care if there are any missing values or anything else, so long as all the values are finite and real (in this instance), and the dimensions match.

.

dpb 2021년 11월 13일

편집: dpb 2021년 11월 15일

If you want to try to interpolate a missing station reading for a given instance at times for which the alternate stations are available, then perhaps the MLR option would make some sense. This would not be interpolation with time, however, your model has no time component.

For precipitation data, that's probably at least as good as, and perhaps better than using interpolation over time for the given station since preciptation events occur stochastically, not with any functional form.

However, any technique such as this is highly fraught with danger, particularly when applied globally without serious model checking including visualization.

I would venture your replacement of NaN values with zeros and doing the regression with those is less reliable than if you were to remove those observations entirely from the data set.

In order to then subsequently use this model, you would have to apply it individually to the observations where there is data for the predictor variables and the response station is missing; not globally. It would also this way need a model for each missing station using some other set of predictors for each; it would not be reasonable to assume the same model would be at all meaningful for all missing locations.

It's an interesting concept; it would be far easier to make realisitc comments if you were to attach the dataset for folks to poke at.

As @Dave B says, you've misinterpreted the return values from regress() above, read the documentation much more carefully before proceeding.

Dave B 2021년 11월 13일

I was also suspicious of the replacement of 0 values with NaN's. If there are cases where you have NaN for more than one station, then this will artificially raise the correlation coefficients and simultaneously reduce the (true) predictability of the model.

Another thought: whenever I'm dealing with regression problems I like to use a plot for a reality check. Consider: can you make a plot to see what's going on here? This can be a little tricky with mutliple dimensions, but you might be able to visualize this with scatter3 for the raw data and either scatter3 or a surface for the predicted results. It will likely be easier to do this if you choose a subset of the data for this approach.

dpb 2021년 11월 13일

I think it will be highly dependent upon the type of rain events prevalent in the given area--if rainfall tends to be widespread, the idea would seem to have merit. If, OTOH, most rainfall is local thunderstorms as it is here where I am located, unless the predictor stations are very close indeed, the chances are they won't be particularly good surrogates.

We can get a nice rain here at the house but often by the time get to east edge of town only 2.5 miles W it may have only sprinkled if done anything at all. Of course, it can also be entirely the reverse as well. We sat at the dining room table (the room about 14' across) of the small house I grew up in at noon one day when I was a kid and watched the rain run off the eaves out the west window for almost 30 minutes before the east half of the house got wet. "It has to end somewhere."

This kind of thing is why serious model verification and data exploration would be imperative.

Muhammad Haris Siddiqui 2021년 11월 13일

@Dave B thank you for Q2 answer.I got it, I should have read the documentation carefully.

@dpb you understand the problem correctly.

however, your model has no time component

I have synchronized all the times series with time steps(start date and end date) and then removed the time component to easily handle the data.

The stochastic appearance of rainfall urged me to do the MLR instead of vertical interpolation of individual time series(station).

It is therefore I first check the correlation coefficients of these 3 time series which appears to be very strong.

@Dave B @dpb yes that's is my mistake I should have removed the nan rows instead of converting it to zeros.

Sorry can't attach data set as it is confidential and has Government protection laws on it.

dpb 2021년 11월 13일

As @Dave B and I have both noted, simply a correlation coefficient isn't nearly enough to give me enough assurance to blindly apply the technique proposed...

Muhammad Haris Siddiqui 2021년 11월 13일

Anyways thank you so much for the guidance.

If you got any other idea on how to fill gaps in time series using MLR kindly post it here.

Dave B 2021년 11월 14일

MATLAB Online에서 열기

A challenge here might be a preponderance of zero data. Suppose I took two weather stations that are far apart, but in the same general region. We might expect that they have rain in the same season, on days that it rains in one it's more likely that it rains in the other, but the amount of rain on those days might be totally uncorrelated.

This would produce a very high correlation coefficient, because those days where it doesn't rain have identical values (0) and those days where it does rain it's more likely to rain in both. Here's a very very reduced example:

x=zeros(10000,1);

y=zeros(10000,1);

n=500;

r=randsample(10000,n);

% when it rains, it typically rains in both, but an uncorrelated amount

x(r)=(rand(n,1)>.05).*(3*randn(n,1)+20);

y(r)=(rand(n,1)>.05).*(3*randn(n,1)+20);

scatter(x,y,'.')

rho=corr(x,y);

[b,~,~,~,stats]=regress(y,[x ones(size(x)) ]);

xi=xlim;

yi=polyval(b,xi);

hold on

plot(xi,yi);

[stats(1) rho^2] % just a reality check

ans = 1×2

0.8567 0.8567

dpb 2021년 11월 14일

I would think one would want to build the model including data observations only where

have observations for both all the proposed predictor locations and the response location,
there is measurable accumlation at at least one of the predictor locations or the response location.

I would investigate very thoroughly the preponderance of instances of the latter in 2. above -- there being observed accumulation at the response station with no accumulation at any of the proposed predictors. That, of course, is certainly not the only diagnostic/model verification step that should be taken, but it's one very obvious one.

Muhammad Haris Siddiqui 2021년 11월 14일

@Dave B Thank you for your reply.

I know your concern with the preponderance of zero data.

Let me clear here that the stations I selected are not far apart (in a relatively huge catchment). The distance measred is 1.7 km and 5.5 km.

so if it rains, it rains in all these 3 rain gauges if not then we have zeros in all three.

I think that's why I have such high correlation (0.9) also (0.9-R-squared value).

" but the amount of rain on those days might be totally uncorrelated. " => I doubt that. a fraction which is a natural phenomenon but not totally.

We can also consider the instrumentation error and the type of raingauges(tipping bucket,standard etc)

but in the end this is what usually hydrologist have to deal with.

@Dave B Thank you for your reply

" It would also this way need a model for each missing station using some other set of predictors for each; it would not be reasonable to assume the same model would be at all meaningful for all missing locations."

Indeed, a single model can not be applied to every station missing data.

I will select the stations (closest) in the whole catchment to perfrom gap filling using different models once i know the procedure.

@Dave B @dpb I am obliged for this informative discussion.

Dave B 2021년 11월 14일

@Muhammad Haris Siddiqui glad it's helpful.

I'm not sure if you saw this above, but I really do think that plotting is a good place to start. This is really where MATLAB shines, because you take a problem and have a sort-of interactive data interrogation instead of just trying something and seeing whether it works or not.

I often start with the simplest case I can and go from there: Suppose you start with just using one station to predict another, it's really easy to plot (raw data, fit line, predictions). You can see how well it works and where it's failing and a plot will tell you why it's doing what it's doing.

Then expand to multiple stations, you'll need to adjust you plot: you might consider looking at the prediction of each station independently as above, and then the combined using some combination of tools like scatter3 or surface.

dpb 2021년 11월 14일

Again, the effectiveness of this model will be highly dependent upon the type of rainfall event your particular catchment sees. If it is, indeed true that "when it rains, it rains" over a large area, then it will likely be a fairly representative surrogate. If it's AZ or SW KS, I'd venture "not so much".

I wholeheartedly agree with @Dave B that visualization should be a key component of exploratory model-building and verification; simply relying on blind correlation is not science.

Muhammad Haris Siddiqui 2021년 11월 15일

@Dave B @dpb I agree with both of you. Thank you for you time and effort you put in this problem.

Indeed, it is a valuable discussion.

filling data gaps in time series using multiple linear regression

댓글 수: 16
이전 댓글 14개 표시 이전 댓글 14개 숨기기

답변 (0개)

카테고리

제품

릴리스

태그

Community Treasure Hunt

filling data gaps in time series using multiple linear regression

댓글 수: 16 이전 댓글 14개 표시 이전 댓글 14개 숨기기

답변 (0개)

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 16
이전 댓글 14개 표시 이전 댓글 14개 숨기기