Find and reduce a numeric array with identical columns

Dear Sir/Madam,
I would like to ask you the following question:
I have a data file like this
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
I would like re-group (or reduce) it with following conditions:
For any row, if column 1 AND column 3 of this row is identical with any column 1 AND column 3 of any other row. Then reduce to one new row with new value of column 2, this new value of column 2 is the sum of original values of column 2. Column 1 is kept the same, Column 4 is not important.
So, for above data, I expect to have the answer:
119 5 6832 0.20313 % 5+4-4=5
122 -48 6853 0.29688
125 -8 6853 0.4375 % -12+4=-8
121 12 6914 0.5625
119 48 6914 0.17188
What Matlab command to use? I would greatly appreciate it if you left your code and running output.
I am using MATLAB R2014a.
Thank you very much

댓글 수: 3

Stephen23
Stephen23 2018년 12월 30일
편집: Stephen23 2018년 12월 30일
The order of the rows in your output is not clearly defined. What is the rule to get that order?
For example, both 121 and 122 each only occur once in the first column, but in the output matrix are listed neither in the sequence that they occur in the input matrix, nor in numeric order. How is this order supposed to be determined?
I was wondering the same thing. Hopefully the order doesn't matter. I'm sure you could write the code afterwards in such a ways that it didn't matter.
John Smith
John Smith 2018년 12월 30일
편집: John Smith 2018년 12월 30일
In tmp (data file ), the order of the rows were randomly inputed by hand, no order at all.

댓글을 달려면 로그인하십시오.

 채택된 답변

Stephen23
Stephen23 2018년 12월 30일

0 개 추천

>> [~,X,Y] = unique(tmp(:,[1,3]),'rows');
>> out = tmp(X,:);
>> out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out =
119.000000 5.000000 6832.000000 0.023438
119.000000 48.000000 6914.000000 0.171880
121.000000 12.000000 6914.000000 0.562500
122.000000 -48.000000 6853.000000 0.296880
125.000000 -8.000000 6853.000000 0.437500

댓글 수: 7

John Smith
John Smith 2018년 12월 30일
편집: John Smith 2018년 12월 30일
Thank you very much Stephen,
I need further the output file "out" to be re-rewriten as classified by the value of first column and column 3 of "out":
  1. If the values of the third column are not the same. then separated them: according to the value of third column:
out_new(:,:,1) =
119 5 6832 0.20313
0 0 0 0
out_new(:,:,2) =
122 -48 6853 0.29688
125 -8 6853 0.078125
out_new(:,:,3) =
119 48 6914 0.17188
121 12 6914 0.5625
2 then the output rewritten according to the value of first column and aligned all group according to the value of the first column.
3. The value of column 1 should be decreased order in value.
I expect the output file as the following format:
125 -8 6853 0.078125
122 -48 6853 0.29688
121 12 6914 0.5625
119 5 6832 0.20313 48 6914 0.17188
By the way, my code to obtain out_new is (may be not so good:)
length(out(:,1));
[C, ia, ic] = unique(out(:,3),'sorted');
L = length(C)
for j=1 : L
k=0;
for i=1 : length(out(:,1))
if C(j) == out(i,3)
k=k+1;
out_new(k,1,j) = out(i,1) ;
out_new(k,2,j) = out(i,2) ;
out_new(k,3,j) = out(i,3) ;
out_new(k,4,j) = out(i,4) ;
end
end
end
format shortG
out_new
Thank you very much
Wow, this is starting to get very quirky and complicated. Can't you take it from here? Why do you need this super specialized format anyway? What's the use case?
John Smith
John Smith 2018년 12월 31일
편집: John Smith 2019년 1월 1일
Dear Sir/Madam,
Actually, I also need column 3 to be converted to date, then the requried output format is more meaningful: (column 4 is the costs of options)
out(:,3)
6832
6914
6914
6853
6853
datestr(out(:,3))
ans =
14-Sep-0018
05-Dec-0018
05-Dec-0018
05-Oct-0018
05-Oct-0018
then the requried output format should be:
125 -8 05-Oct-0018 0.078125
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep_0018 0.20313 48 05-Dec-0018 0.17188
Now you can see, the horizontal axis is time, the vertical axis (column 1) is strik-price of option.
In this way my option trading positions are:
e. g. for 119 strike price, I have two kinds of options: bought 5 options with expiration date 14-September, and bought 48 options with expirarion date 05-Dec.
for 122 strike price option, I have only one kind option: sold 48 options with expiration date 05-Oct.
That is why this kind of output format in important.
Thank you very much for your help.
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
Prints this:
125 -8 05-Oct-0018 0.4375
122 -48 05-Oct-0018 0.29688
121 12 05-Dec-0018 0.5625
119 5 14-Sep-0018 0.023438 48 05-Dec-0018 0.17188
Dear Stephen,
Thank you very much for your help, and Happy NEW YEAR 2019. Running this:
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
% Convert to character:
baz = num2cell(out(:,2:4)).';
baz(2,:) = cellstr(datestr(out(:,3)));
baz = regexp(sprintf('%3d %s %g\v',baz{:}),'[^\v]+','match');
% Arrange into matrix:
[U,~,X1] = unique(out(:,1));
[V,~,X3] = unique(out(:,3));
C = accumarray([X1,X3],baz,[],@(s)s);
L = cellfun('length',C(:));
C(L==0) = {''};
C = flipud([num2cell(U),C]).'
% Print:
fmt = sprintf('%%-%ds',2+max(L));
fmt = repmat(fmt,1,numel(V));
fmt = ['%4d',fmt,'\n'];
fprintf(fmt,C{:})
I got this error:
Error using accumarray
Second input VAL must be a full numeric, logical, or char vector or scalar.
Error in main_ZN_Dec_7_2018_trade_4 (line 364)
C = accumarray([X1,X3],baz,[],@(s)s)
Stephen23
Stephen23 2019년 1월 1일
편집: Stephen23 2019년 1월 1일
Replace the line with accumarray with these three lines:
S = max([X1,X3]);
C = cell(S);
C(sub2ind(S,X1,X3)) = baz;
and a Happy New Year!
John Smith
John Smith 2019년 1월 2일
편집: John Smith 2019년 1월 3일
Dear Stephen,
By changing those three lines, you code works.
Back to the question in the very beginning, I said " Column 4 is not important". but now, I need to treat column 4 the same way as column 2:
tmp =
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438
out =
119 5 6832 0.265631 (%=0.20313+0.039063+0.023438)
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.515625 (%=0.078125+0.4375)
how do you change your three line code:
[~,X,Y] = unique(tmp(:,[1,3]),'rows');
out = tmp(X,:);
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
I tried to modify the following way, it did not work:
out(:,2) = accumarray(Y,tmp(:,2), tmp(:,4), [],@sum)
However, when I use two lines (two time) then it worked:
out(:,2) = accumarray(Y,tmp(:,2),[],@sum)
out(:,4) = accumarray(Y,tmp(:,4),[],@sum)
Thank you very much.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Image Analyst
Image Analyst 2018년 12월 30일
편집: Image Analyst 2018년 12월 30일
What about using grpstats(), if you have the Statistics and Machine Learning Toolbox.
tmp = [...
121 12 6914 0.5625
122 -48 6853 0.29688
119 48 6914 0.17188
125 -12 6853 0.078125
125 4 6853 0.4375
119 5 6832 0.20313
119 4 6832 0.039063
119 -4 6832 0.023438]
col5 = 10000*tmp(:, 1) + tmp(:, 3)
tmp = [tmp, col5];
% No sum in grpstats, so have to do it twice.
% Once to get the mean and once to get the count.
outputMean = grpstats(tmp, tmp(:, 5), 'mean')
outputNumel = grpstats(tmp, tmp(:, 5), 'numel')
% Crop off temporary 5th column
output = outputMean(:, 1:4) % Initialize
% Column 2 is the sum = mean * count
output(:, 2) = outputMean(:, 2) .* outputNumel(:, 2)
The output seems to be sorted by the first column though:
output =
119 5 6832 0.088544
119 48 6914 0.17188
121 12 6914 0.5625
122 -48 6853 0.29688
125 -8 6853 0.25781
That might be a problem for you. I'm not sure. Of course column 4 can be cropped off or ignored since you say it's not important.

카테고리

도움말 센터File Exchange에서 Repeated Measures and MANOVA에 대해 자세히 알아보기

제품

릴리스

R2014a

질문:

2018년 12월 30일

편집:

2019년 1월 3일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by