MATLAB Answers

Matlab parfor saves and loads temporary variables during execution!

조회 수: 22(최근 30일)
SK
SK 21 Jul 2018
편집: Matt J 29 Jul 2018
The parallel pool implementation appears to save and load variables using effectively the normal "save" and "load" channels when making copies to be passed to the workers. This is really bad because if someone has a saveobj() implemented in a class object the state of the variable could be modified during execution and depending on the circumstances could lead to "unexplained" crashes.
Lets say I have a class object that needs a large amount of temporary data during execution. When saved I would naturally want to get rid of this temporary data. A convenient way to do this is to write a saveobj that clears the temporary data within the class. If, however, the object is saved and reloaded during execution as with parfor, there are problems.
Why is the copy not made internally?

  댓글 수: 1

Matt J
Matt J 29 Jul 2018
Another situation where this will be an issue is when calling PACK. The documentation doesn't mention that saveobj is called when pack() is invoked. But it is.

로그인 to comment.

채택된 답변

Matt J
Matt J 21 Jul 2018
편집: Matt J 21 Jul 2018
I don't know why authoritatively, but it seems to me that this is only a danger if you are unaware of this behavior on the part of parfor. If you are aware of it, you would surely write a loadobj() method to restore the temporary data when reloaded. And, this would be better than broadcasting copies of the temporary data to all the workers. The latter is done serially and would take a lot more time.
Incidentally, have you verified that saveobj() is triggered in this scenario just as in an ordinary call to save()?

  댓글 수: 7

표시 이전 댓글 수: 4
SK
SK 28 Jul 2018
If you read the response I gave to Eric, you will see that it is not just one type of problem that I'm talking about. In some types of class design, save() throws away state information and it can't be recovered. Actually that type of design is forced by the Matlab ClassificationXXX classes and if one wants to wrap them there is no alternative to throwing away unrecoverable state information from the class object. The choice then is between having a specially named save method (since the generic save() becomes fundamentally unusable for ensembles of these classes) or fixing up the save via saveobj() to do the best it can - as I have done. I'm not sure that either one of the options is better than the other. (If I were to design the Matlab classification classes I would do them differently - but that is a different topic). The original post was motivated by a problem I encountered with these classes.
In all other cases the state information thrown away is the result of computation.
The parallel pool problem affects both types of cases.
When you overload builtin methods that other users or other MATLAB
mechanisms might call (e.g., the parallel pool), it is best not to
stray too far from the purpose of the original method, and the purpose
of save() is to store the object in a fully recoverable form.
I don't think throwing away temporary information is straying too far from the original purpose of the save() method, as long as it is documented in the class. These special purpose computation classes are not typically generic ones that are are designed to be "plug and play".
But in your current code, any function that saves this object has to
know about save().
Any code that saves my class object just has to know that the temporary state is not saved - and it can decide whether or not to react in some way to that information. Very often not saving the temporary state is acceptable and code using my class need not do anything special. If the code needs the temporary state, it just calls appropriate class methods that are conveniently provided to get the computed temporary information prior to saving my class object using exactly the same function call as any other object - that is a big difference.
Matt J
Matt J 28 Jul 2018
I don't think throwing away temporary information is straying too far from the original purpose of the save() method, as long as it is documented in the class.
Except that you want your class to work with parallel pools. Parallel pools, and maybe various other things in Matlab, expect load() to be the 1-to-1 inverse of save(). The burden is on you to meet those interface requirements. They will not conform to your class documentation.
In some types of class design, save() throws away state information and it can't be recovered. Actually that type of design is forced by the Matlab ClassificationXXX classes and if one wants to wrap them there is no alternative to throwing away unrecoverable state information from the class object.
Well, no, the alternative is to not call compact() in your saveobj method and to endure the additional consumption of disk space that this will bring about. Then you will have no conflict with parallel pool or other Matlab toolboxes.
I can appreciate, however, that storing multiple copies of the same data to disk is unappealing. But there are still options if you want to avoid that, and which don't involve breaking the 1-1 correspondence of save/load. One way is to store objects sharing common data together in arrays. Then your saveobj/loadobj pair can do things like this,
classdef myclass
methods
function s=saveobj(objArray)
s.shareddata=objArray(1).data;
[objArray.data]=deal([]);
s.objArray=objArray;
end
end
methods (Static)
function objArray=loadobj(s)
objArray=s.objArray;
[objArray.data]=deal(s.shareddata);
end
end
end
Unfortunately, it doesn't look like ClassificationXXX classes let you do this, but that might be what ClassificationEnsemble classes are intended for.
SK
SK 29 Jul 2018
Unfortunately, it doesn't look like ClassificationXXX classes let you
do this, but that might be what ClassificationEnsemble classes are
intended for.
Yes, but you may want to roll your own ensemble classes (as in my case) in a way that doesn't fit into the ClassificationEnsemble mould.
Regarding your other remarks, you are essentially asking me to "put up or shut up". I've been doing that for a while with Matlab but I do make the occasional post with my criticisms which I think are valid ones.
In my opinion, if save() can be customized via saveobj(), then the parallel pool should not be using it as a means of passing data between processes. Data transfer can be done in some other way, avoiding the custom saveobj(). The effect should be to make a deep copy of the object and save its bits and nothing else. Of course it would involve a little more effort on the part of implementors.
Anyway, I don't want to sound like I'm giving advice to those who write the internals of Matlab since I have no idea of the problems they face in the context of the overall design of Matlab. Moreover, even if they are doing something wrong, the software is theirs to mess up and not mine. So I would prefer to end the discussion here.
Thank you for your comments.

로그인 to comment.

추가 답변(1개)

Edric Ellis
Edric Ellis 23 Jul 2018
The workers executing the body of any parfor loop are separate MATLAB processes, so the only reliable way that variables existing on the client can be sent to those workers is by doing something equivalent to calling save on the client, and then load on the workers. (The same procedure is used, but no files are created on disk).
I must admit, it's not clear to me what the benefit is of writing a saveobj that cannot be reversed by a loadobj. (Also, have you considered using Transient fields in your class?)

  댓글 수: 1

SK
SK 28 Jul 2018
Sorry, I didn't see this post earlier.
Let's say a class needs to do some long computation where its (typically private) methods need to share a number of variables that are not needed after the long computation. Instead of passing these variables back and forth between methods, these common variables are stored in the class object. This issue is particularly relevant for a language designed for computation: It is not unusual for a class to be designed to do a few complex computations on any given piece of data - the results of interest are often best stored in a struct so that they can be easily shared with other people or with other classes that do not need to know about how the results were created. The actual class object is only used during the course of the computation which can be long and complex.
Another concrete example, but different in nature from the previous one, involving one of the Matlab toolboxes: Look at any of the ClassificationXXX classes in Matlab. They have a a method called compact(). This method empties the data from the variable X in the class object. If multiple such classes are created with the same data (this happens often), then a natural thing to do before saving an object of this class would be to save one instance of X somewhere and call compact() on all the objects. However you don't want this to happen in the middle of a computation.
'Transient' is convenient: Thanks for the tip (I had seen it before but had forgotten about it). However it doesn't help in this context, because the object state needs to be restored.
For the moment my gut_on_save flag seems to do the job.

로그인 to comment.

이 질문에 답변하려면 로그인을(를) 수행하십시오.


Translated by