필터 지우기
필터 지우기

Sampling from a population so that the sample has a target mean/median.

조회 수: 4 (최근 30일)
Hi all,
if there a direct way to randomly collect samples from an existing population multiple times so that each time the sample has a target mean or median for a given metric?
Example:
Dataset = 400 observations X 10 metrics
I want to sample N times (with replaecement) 20 observations so that the mean or median for each 20-observation sample of metric 1 equals a target value and then calculate the mean for the other 9 metrics for each of these 20-observation samples.
Is there a way to do such sampling in Matlab?
Thank you!
  댓글 수: 3
Antonios Asiminas
Antonios Asiminas 2022년 4월 18일
Apologies for confusion.
What I mean is: Sampling from Dataset(:,1) and get indexes for the elements of each sample (idx) so that each sample has mean(Dataset(idx, 1)) == target, and calculate mean(Dataset(idx, 2)), mean(Dataset(idx, 3)),... mean(Dataset(idx, 10)). Repeat that N times.
The "or median" was cover the posibility there is a solution for this problem with median target rather than mean.
I thought trying a while loop and sample, check if the mean of the sample is the target (or close enough) and then calculate the other metrics means. This is not a nice a certainly not a fast solution though...
I hope this makes more sense now.
Image Analyst
Image Analyst 2022년 4월 18일
Not really. How would you compute idx? And I'm not sure what N is when you say that you need to get the 10 means N times.
I'm thinking what you really want is like what Bruno said where you'd get a list of indexes that match a target or are in a target range, and then get the means for the other columns, so like
rowsOfInterest = Dataset(:, 1) == target;
theMeans = mean(Dataset(rowsOfInterest, :), 1);
That would give you the mean of all columns but only for those rows where column 1 is your target value.
Of course you could also use grpstats() or groupsummary() or splitapply() to do the same thing.

댓글을 달려면 로그인하십시오.

채택된 답변

Bruno Luong
Bruno Luong 2022년 4월 18일
편집: Bruno Luong 2022년 4월 18일
This is for "mean" (EDIT code for fixing BUG)
% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmean=35; % taget mean
meanA = mean(A);
if targetmean > meanA
dir = 'ascend';
else
dir = 'descend';
end
As = sort(A, dir);
A1 = As(1);
Aend = As(end);
t = (As-A1)/(Aend-A1);
nA = length(A);
% if you give non-attainable targetmean, you get error here or NaN p
EspFun = @(p) sum(As.*t.^p) / sum(t.^p);
p = fzero(@(p) EspFun(p)-targetmean, [0 1000]);
if isnan(p)
error('target mean not possible with this formulation')
end
idx = ceil(nA*rand(1,m).^(1/(p+1)));
Asubsample = As(idx);
% Check
mean(Asubsample)
ans = 34.5036
  댓글 수: 2
Bruno Luong
Bruno Luong 2022년 4월 18일
편집: Bruno Luong 2022년 4월 18일
If you want non replacement draw you have to set m value larger than the subsample cardinal
desiredsubsamplecardinal = 100;
m = ceil(1.1*desiredsubsamplecardinal);
then later do
Asubsample = As(unique(idx));
if length(Asubsample) >= desiredsubsamplecardinal
Asubsample = Asubsample(1:desiredsubsamplecardinal);
else
... retry
end
Bruno Luong
Bruno Luong 2022년 4월 18일
편집: Bruno Luong 2022년 4월 19일
This is for "median" (EDIT code for fixing BUG)
% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmedian=35; % taget edian
medianA = median(A);
nA = length(A);
if medianA > targetmedian
dir = 'ascend';
thalf = sum(A <= targetmedian) / nA;
else
dir = 'descend';
thalf = sum(A >= targetmedian) / nA;
end
As = sort(A, dir);
p = max(log(thalf)/log(0.5),0);
idx = ceil(nA*rand(1,m).^p);
Asubsample = As(idx);
% Check
median(Asubsample)
ans = 35.5359

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Image Analyst
Image Analyst 2022년 4월 18일
No, not in general. The target mean may not exist in the population. If my target mean weight is 10 kg, and my population is the weights of elephants, I cannot sample the weights of the elephants and ever get a weight of 10 Kg.
If the targets are definitely in your population then you can't guarantee any mean if you randomly pick some samples. You may however get close to the population mean. If you want a specific mean I think you'd have to sort your population and then sample around the target mean, but then you're no longer doing it randomly.
  댓글 수: 9
Beorn Nijenhuis
Beorn Nijenhuis 2022년 8월 9일
@Bruno Luong This was a helpful script for me. I had a cohort of n=280 samples of peoples ages (15<age < 75) and needed a subsample of 60 with a mean of 50. It worked. I have a question though: When I run this code I notice a tolerance in the output. With n=280 the dolerance was ±7 years approximatly. How is this tolerance calculated so I can write this in to the methods of my paper?
Bruno Luong
Bruno Luong 2022년 8월 9일
I don't know how it is calculate, but it's due to the fact that the sample is too sparse. My method assume it is well approximated a uniform distrubution; so it works well when one have a big sample of data.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Dimensionality Reduction and Feature Extraction에 대해 자세히 알아보기

제품


릴리스

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by