Calculating group statistics using frequency weights
조회 수: 5 (최근 30일)
이전 댓글 표시
I have a large table of survey data with about 3 Mio. observations and 120 variables. The survey also contains a variable called "weights" with integers between 250 and 250'000. These weights are intended to render the sample representative. Hence I have to weight every observation for every calculation. I need a number of (weighted) group means, medians and percentiles.
How can I calculate summary statistics and group statistics, weighting the observations with the frequency weights?
Considering the size of the dataset, I am looking for a solution that works for a large dataset of dimensions I mentioned above. (If I needed the unweighted statistics, it can easily and efficiently be done with the groupsummary() function, I have however not found an option allowing for the weights.)
Thanks for your help
Chris
댓글 수: 5
Adam Danz
2019년 12월 18일
I added an answer that shows how to scale your inputs according to your weights.
답변 (1개)
Adam Danz
2019년 12월 17일
Since you integer weights are much too large to merely replicate values based on the weights, you can scale your data according to your weights. This is only 1 of many interpretations of applying weights.
There are several ways around this and the best method depends on how you're using the weights, what those weights mean, and the bounds of those weights. That's something you'll need to think about.
Here's my proposal.
% Create demo data
x = randi(10,1,20); % Main data: 20 random integers
w = randi(24750,size(x))+250; % random weights between 250 and 250000
This is the part you'll need to consider. The idea is to normalize your weights between [0,1] but keep in mind that a weight of 0 will completely eliminate a value.
% Normalize the weights
% If you know the upper and lower limits of the weights (Safer than alternative)
knownWeightBounds = [250,250000];
% Or maybe use
knownWeightBounds = [0,250000];
% or maybe
knownWeightBounds = [min(w), max(w)];
% Scale the weights to 0:1
wNorm = (w-knownWeightBounds(1))/range(knownWeightBounds);
Now scale your data according to the weights.
% Scale the data by normalized weights.
xScaled = x .* wNorm;
% compute whatever stats you want on xScaled
mu = mean(xScaled)
댓글 수: 0
참고 항목
카테고리
Help Center 및 File Exchange에서 Time Series에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!