필터 지우기
필터 지우기

Calculating group statistics using frequency weights

조회 수: 10 (최근 30일)
clauper
clauper 2019년 12월 16일
댓글: Adam Danz 2019년 12월 18일
I have a large table of survey data with about 3 Mio. observations and 120 variables. The survey also contains a variable called "weights" with integers between 250 and 250'000. These weights are intended to render the sample representative. Hence I have to weight every observation for every calculation. I need a number of (weighted) group means, medians and percentiles.
How can I calculate summary statistics and group statistics, weighting the observations with the frequency weights?
Considering the size of the dataset, I am looking for a solution that works for a large dataset of dimensions I mentioned above. (If I needed the unweighted statistics, it can easily and efficiently be done with the groupsummary() function, I have however not found an option allowing for the weights.)
Thanks for your help
Chris
  댓글 수: 5
clauper
clauper 2019년 12월 17일
Thanks for your comment. I have updated the question to give more context.
Adam Danz
Adam Danz 2019년 12월 18일
I added an answer that shows how to scale your inputs according to your weights.

댓글을 달려면 로그인하십시오.

답변 (1개)

Adam Danz
Adam Danz 2019년 12월 17일
Since you integer weights are much too large to merely replicate values based on the weights, you can scale your data according to your weights. This is only 1 of many interpretations of applying weights.
There are several ways around this and the best method depends on how you're using the weights, what those weights mean, and the bounds of those weights. That's something you'll need to think about.
Here's my proposal.
% Create demo data
x = randi(10,1,20); % Main data: 20 random integers
w = randi(24750,size(x))+250; % random weights between 250 and 250000
This is the part you'll need to consider. The idea is to normalize your weights between [0,1] but keep in mind that a weight of 0 will completely eliminate a value.
% Normalize the weights
% If you know the upper and lower limits of the weights (Safer than alternative)
knownWeightBounds = [250,250000];
% Or maybe use
knownWeightBounds = [0,250000];
% or maybe
knownWeightBounds = [min(w), max(w)];
% Scale the weights to 0:1
wNorm = (w-knownWeightBounds(1))/range(knownWeightBounds);
Now scale your data according to the weights.
% Scale the data by normalized weights.
xScaled = x .* wNorm;
% compute whatever stats you want on xScaled
mu = mean(xScaled)

카테고리

Help CenterFile Exchange에서 Pulsed Waveforms에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by