Sampling from a population so that the sample has a target mean/median.

Question

Antonios Asiminas 2022년 4월 18일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1699215-sampling-from-a-population-so-that-the-sample-has-a-target-mean-median

댓글: Bruno Luong 2022년 8월 9일

Hi all,

if there a direct way to randomly collect samples from an existing population multiple times so that each time the sample has a target mean or median for a given metric?

Example:

Dataset = 400 observations X 10 metrics

I want to sample N times (with replaecement) 20 observations so that the mean or median for each 20-observation sample of metric 1 equals a target value and then calculate the mean for the other 9 metrics for each of these 20-observation samples.

Is there a way to do such sampling in Matlab?

Thank you!

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Antonios Asiminas 2022년 4월 18일

Apologies for confusion.

What I mean is: Sampling from Dataset(:,1) and get indexes for the elements of each sample (idx) so that each sample has mean(Dataset(idx, 1)) == target, and calculate mean(Dataset(idx, 2)), mean(Dataset(idx, 3)),... mean(Dataset(idx, 10)). Repeat that N times.

The "or median" was cover the posibility there is a solution for this problem with median target rather than mean.

I thought trying a while loop and sample, check if the mean of the sample is the target (or close enough) and then calculate the other metrics means. This is not a nice a certainly not a fast solution though...

I hope this makes more sense now.

Image Analyst 2022년 4월 18일

MATLAB Online에서 열기

Not really. How would you compute idx? And I'm not sure what N is when you say that you need to get the 10 means N times.

I'm thinking what you really want is like what Bruno said where you'd get a list of indexes that match a target or are in a target range, and then get the means for the other columns, so like

rowsOfInterest = Dataset(:, 1) == target;
theMeans = mean(Dataset(rowsOfInterest, :), 1);

That would give you the mean of all columns but only for those rows where column 1 is your target value.

Of course you could also use grpstats() or groupsummary() or splitapply() to do the same thing.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Bruno Luong 2022년 4월 18일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1699215-sampling-from-a-population-so-that-the-sample-has-a-target-mean-median#answer_945290

편집: Bruno Luong 2022년 4월 18일

MATLAB Online에서 열기

This is for "mean" (EDIT code for fixing BUG)

% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmean=35; % taget mean
meanA = mean(A);
if targetmean > meanA
    dir = 'ascend';
else
    dir = 'descend';
end
As = sort(A, dir);
A1 = As(1);
Aend = As(end);
t = (As-A1)/(Aend-A1);
nA = length(A);
% if you give non-attainable targetmean, you get error here or NaN p
EspFun = @(p) sum(As.*t.^p) / sum(t.^p);
p = fzero(@(p) EspFun(p)-targetmean, [0 1000]);
if isnan(p)
    error('target mean not possible with this formulation')
end
idx = ceil(nA*rand(1,m).^(1/(p+1)));
Asubsample = As(idx);
% Check
mean(Asubsample)
ans = 34.5036

댓글 수: 2
없음 표시없음 숨기기

Bruno Luong 2022년 4월 18일

편집: Bruno Luong 2022년 4월 18일

MATLAB Online에서 열기

If you want non replacement draw you have to set m value larger than the subsample cardinal

desiredsubsamplecardinal = 100;
m = ceil(1.1*desiredsubsamplecardinal);

then later do

Asubsample = As(unique(idx));
if length(Asubsample) >= desiredsubsamplecardinal
    Asubsample = Asubsample(1:desiredsubsamplecardinal);
else
    ... retry
end

Bruno Luong 2022년 4월 18일

편집: Bruno Luong 2022년 4월 19일

MATLAB Online에서 열기

This is for "median" (EDIT code for fixing BUG)

% Generate 1e6 Dummy Data test
A=10+sum(30*rand(1e6,3),2);
% number of subsamples to be draw from A
m = 1000;
targetmedian=35; % taget edian
medianA = median(A);
nA = length(A);
if medianA > targetmedian
    dir = 'ascend';
    thalf = sum(A <= targetmedian) / nA;
else
    dir = 'descend';
    thalf = sum(A >= targetmedian) / nA;
end
As = sort(A, dir);
p = max(log(thalf)/log(0.5),0);
idx = ceil(nA*rand(1,m).^p);
Asubsample = As(idx);
% Check
median(Asubsample)
ans = 35.5359

댓글을 달려면 로그인하십시오.

Answer 2

Image Analyst 2022년 4월 18일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1699215-sampling-from-a-population-so-that-the-sample-has-a-target-mean-median#answer_945255

No, not in general. The target mean may not exist in the population. If my target mean weight is 10 kg, and my population is the weights of elephants, I cannot sample the weights of the elephants and ever get a weight of 10 Kg.

If the targets are definitely in your population then you can't guarantee any mean if you randomly pick some samples. You may however get close to the population mean. If you want a specific mean I think you'd have to sort your population and then sample around the target mean, but then you're no longer doing it randomly.

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

Antonios Asiminas 2022년 4월 19일

Thank you again for the suggestions, discussion and code.

The use case is the following: I have several observations (100s) of a population for two time points. I have calculated X metrics for each observation. When I calculate the statistics for each metric/time point, I get a change (subtle but robust) in a few metrics between timepoints.

Now the question I have is: Could the changes in the majority of metrics secondary to the change in one target metric?

To address this question I am planning to get the mean from my first timepoint for the target metric and sample from the second timepoint many times a smaller population (20-30) with the same mean as the baseline for the target metric, and then calculate the means for the remaining metrics of interest.

If changes in the target metric are primary then the distribution of means for the different metrics are not going to be different between timepoints.

I suspect I can reach a similar conclusion by just correlating the target metric and each of the other metrics?

A

Beorn Nijenhuis 2022년 8월 9일

@Bruno Luong This was a helpful script for me. I had a cohort of n=280 samples of peoples ages (15<age < 75) and needed a subsample of 60 with a mean of 50. It worked. I have a question though: When I run this code I notice a tolerance in the output. With n=280 the dolerance was ±7 years approximatly. How is this tolerance calculated so I can write this in to the methods of my paper?

Bruno Luong 2022년 8월 9일

I don't know how it is calculate, but it's due to the fact that the sample is too sparse. My method assume it is well approximated a uniform distrubution; so it works well when one have a big sample of data.

댓글을 달려면 로그인하십시오.

Sampling from a population so that the sample has a target mean/median.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (1개)

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Sampling from a population so that the sample has a target mean/median.

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (1개)

댓글 수: 9 이전 댓글 7개 표시이전 댓글 7개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 9
이전 댓글 7개 표시이전 댓글 7개 숨기기