Which Anova test and how to use it?
이전 댓글 표시
Good afernoon everyone,
I would like to use an anova test but i unfortunately does not know which one to use.
I have attached an excel file with the datas.
For instance,
I would like to know the relevance when Thickness and orientation are involved. These are the data of 9 individuals with 5 repetitions.
The correct/not represent whether the participants have found the correct answer or not. correct =1 and Not =0
댓글 수: 28
Adam Danz
2022년 7월 12일
> I would like to know the relevance when Thickness and orientation are involved.
What is relevance? How is that measured? If you are looking for accuracy or precision, computed from the "correct/not" column, then you don't need an ANOVA for that.
More importantly, what is the falsifiable question you're asking? Or, what is your null hypothesis?
Franck paulin Ludovig pehn Mayo
2022년 7월 12일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 12일
Adam Danz
2022년 7월 12일
I assume the "recognition rate" is the same as accuracy which is indicated by a "1" in the correct/not column. Let me rephrase your goal with how I interpret it and you can let me know if my interpretation is incorrect.
You've got two independent variables (thickness and orientation). Thickness is on a continuous scale and has 6 levels, orientation is categorical and has 3 levels (horz, vert, control). You've got 1 independent variable which is binary, true/false, that describes some kind of decision so true (1) means correct and false (0) mean not-correct.
There are 9 participants with 5 reps and 18 conditions (6*3) which would result in 810 data points (rows of table) if all participants repeated all conditions 5 times but I only see 721 rows of data.
I still don't know the research question that motivated this design so I can only guess at the null hypothesis. In general, the order of events in a research project is
- define the question (sometimes the hardest part)
- define the null hypothesis given the question
- decide on methodology given the quesiton and null hyp.
- collect data
- analyze and interpret
For example, perhaps thickness or orientaiton is the main variable under question while the other one is a control condition that is not expected to have an effect. Or perhaps you're wondering whether horizontal and vertical orientation statistically differ from the control orientation condition. Another question might involve individual differences between participants. Each of those may have completely different statistical tests.
It sounds like you want to know if there is a statistical difference between some groups and, given the groups are similar, the difference might be small, but you need to find out if the small differences is significant. If you provide more detail on the question you're asking (and the null hypothesis would be nice to know, too), I could help further.
Franck paulin Ludovig pehn Mayo
2022년 7월 12일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 12일
Adam Danz
2022년 7월 12일
Oh, I see. I just looked at the number of unique values and assumed they were fully nested. My bad.
Could I convince you to use bootstrapped confidence intervals instead of using a parametric test such as t-test of anova? The benefits are that non-parametric tests do not have the same assumptions that parametric tests have and they are easier to read have less reliance on subjective thresholds such as p-values. With confidence intervals, you can directly see whether they overlap or not.
Franck paulin Ludovig pehn Mayo
2022년 7월 12일
Adam Danz
2022년 7월 12일
What conditions are you comparing? You mentioned that you made a bar chart. If it's labeled, you could just share a screen shot of that and it would probably answer that question.
Franck paulin Ludovig pehn Mayo
2022년 7월 12일
I don't know how you made this bar plot or how you computed the means. The only thickness that nests with horizontal and vertical orientations is 0.4. You mentioned in a previous comment that some thickness values could be combined but that explanation was confusing. For example, you mentioned 0.02=0.04 but those are treated as separate conditions and the data in your plots how that they have different summary values.
T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1062725/Anovan_Ptestdata.xlsx','VariableNamingRule','preserve');
Tc = groupcounts(T,["thicknesss","orientation"])
Franck paulin Ludovig pehn Mayo
2022년 7월 12일
Franck paulin Ludovig pehn Mayo
2022년 7월 13일
Adam Danz
2022년 7월 13일
I have some ideas on how to proceed but short on free time. My suggestion is to compute bootstrapped confidence intervals using bootci (I recommend setting "type" to "per"). This will be peformed for each condition and will provide the confidence bounds. If the bounds do not overlap, you can conclude that the means (or whatever statistic you choose) come from different distributions. I demo'd this approach in this comment.
Franck paulin Ludovig pehn Mayo
2022년 7월 13일
Adam Danz
2022년 7월 15일
See my answer below.
Scott MacKenzie
2022년 7월 17일
편집: Scott MacKenzie
2022년 7월 17일
@Franck paulin Ludovig pehn Mayo, oops, I meant to post my comment here, not as a comment to Adam's answer. In any event, thanks for your response, which is below.
But, could I ask that you re-post the modified Excel file and include the chart (generated from the data). I've fiddled with the data, but cannot seem to duplicate the chart you posted. Before doing the anova, it's important we are on the same page (i.e., my grouped bar chart looks like yours).
Franck paulin Ludovig pehn Mayo
2022년 7월 17일
Scott MacKenzie
2022년 7월 17일
@Franck paulin Ludovig pehn Mayo, thanks for re-posting the spreadsheet. However, there is no chart in the spreadsheet. What is needed -- to ensure we have the same interpretation of the data -- is a spreadsheet with the data and the chart. The chart must be generated from the data in the 1st worksheet, not from data that are manually entered and are separate from the data in the 1st spreadsheet. It's important to understand how the chart is created from the data for which you are intested in doing an anova.
BTW, I assume "recognition rate" is the mean for each condition of the 1s and 0s in the "correct/not correct" column, expressed as a percent (i.e., x100). Correct?
Also, my initial comment referred to the grouped bar chart you posted called "Exa.JPG". Seems you posted an additional bar chart called "a1.JPG", which is completely different. I'm focusing on Exa.JPG at the moment.
Franck paulin Ludovig pehn Mayo
2022년 7월 17일
Scott MacKenzie
2022년 7월 17일
@Franck paulin Ludovig pehn Mayo, sorry, but trying to figure out how your data were obtained and organized is just too much work. Bottom line is I can't recreate your chart and I can't figure out how you created it from the raw data. The problem (or part of the problem) is that your chart is based on data that were manually transcribed:

I don't know where the numbers in the formula came from, since there were manually entered. The first number is 90, which I assume is for the first participant (P1), but I'm just guessing. Elsewhere in this worksheet or on the worksheet for P1, I don't see this number caclulated anywhere, so it's a bit of a dead end.
Franck paulin Ludovig pehn Mayo
2022년 7월 17일
Scott MacKenzie
2022년 7월 17일
@Franck paulin Ludovig pehn Mayo, I did look at the P7 worksheet, but I still can't sort things out. For example, the first manually-entered value in the formula for cell L17 in the MEAN worksheet is 90. Where does this number come from? The only 90 on the P7 worksheet is also manually entered and it is for a different orientation/thickness condition.
Franck paulin Ludovig pehn Mayo
2022년 7월 17일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 17일
Scott MacKenzie
2022년 7월 18일
@Franck paulin Ludovig pehn Mayo, I'm not sure how to move forward with your data for the purpose of an anysis of variance. Perhaps, a different approach is appropriate. A new issue I just noticed is that some data are missing. From the bar chart you posted (copied below), I had the impression your design was 4x3:

But, it's not. There were measurements on participants only for 6 of these 12 conditions. The six conditions yielding recognition rate of 100% are just made-up, or placeholders, or something. There's likely an explanation and it probably makes sense. But, these bars do not reflect measurements on particiants, as there are no corresponding data in the table. So, perhaps the design (for a possible analyses of varaiance) is 3x2, but I'm not sure.
BTW, on a comment you made earlier -- I thought the participant's information were not needed -- knowing whcih data correspond to which participant is important and a central part of the an analysis of variance.
Perhaps Adam's answer is useful to you. Good luck.
Franck paulin Ludovig pehn Mayo
2022년 7월 18일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 18일
Adam Danz
2022년 7월 18일
@Franck paulin Ludovig pehn Mayo, your question is about applying a statistic to the data but the majority of this thread is back-and-forth questions trying to understand your data. It's really confusing to say that some conditions are actually other conditions. This thread currently has 267 views and almost 30 comments since it was posted 6 days ago which suggests a lot of time has been put into this. It shouldn't be this difficult to explain 12 data points (the number of bars in your figure).
I want to see you succeed in this goal so please let me gives some advice.
In the future, it would benefit you to spend some time cleaning up the data so it's very easy to explain and understand before you ask the question. Also, whenever you generate a plot, provide the code so we don't have to figure out what you're doing. That adds additional tasks we must figure out before we even get to your question. It looks like those bar plots were done outside of MATLAB but taking the time to figure out how to do it in MATLAB so you can ask a clearer question would help out a lot.
Adam Danz
2022년 7월 20일
Just FYI, I deleted Carlos' answer because it was spam. He merely copied content from this Investopedia article and embedded a spam link at the end. Since the entirely of his content is available in the link above and was not authored by him, his content was removed and his profile has been flagged as spam.
I saw you voted for his answer so I wanted to explain why it's no longer here.
Franck paulin Ludovig pehn Mayo
2022년 7월 20일
I came across these statistical methods 15 years ago and am still trying to understand which ones suit different sets of data and questions. It wasn't until about 5 years ago that I realized my long-tem confusion wasn't a problem with my understanding -- it's a problem in the field of statistics in general. So many peer-reviewed articles apply statistics incorrectly or do not show that the data are fit for the selected statistics. Worse yet, some people keep applying different statistics until they get the results they want which is p-hacking. Three years ago hundreds of scientists and statisticians around the globe supported a movement to change how we think about and practice statistics (see list of articles at the bottom of this answer). What's nice about bootstrapped CIs is that they can be used to visualize how closely related are two distributions rather than just providing a number such as p<0.005.
I'm not swaying you away from using an ANOVA method - but I am arguing that the movement mentioned is a big step forward in statistics.
답변 (1개)
I recommend using bootstrapped confidence intervals. The idea is to resample your accuracy data with replacement and compute the mean on the sample for each condition. If you repeat this many times (1000, for example), you'll have a distribution of means which can be used to compute the middle 95% interval. Fortunately MATLAB has a function that does most of the work: bootci which is demo'd in this comment. After you have the CIs for each condition, you can plot them using errorbar. If the CIs do not overlap between two conditions, it is likely that the data from those condtions come from different distributions.
Here's a demo that performs bootstrapped CIs for a single condition in your data. I would set up the loop to compute CIs for all conditions but I still do not understand which conditions to compare since the data do not appear to be nested. Perhaps if the 'thickness' values were corrected in some way, it would be clearer. But first you give it a shot.
T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1062725/Anovan_Ptestdata.xlsx','VariableNamingRule','preserve');
thickIdx = T.thicknesss == 0.04;
orientIdx = strcmp(T.orientation, 'vertical');
CI = bootci(1000, {@mean, T.("correct/not")(thickIdx & orientIdx)}, 'Type', 'per')
mu = mean(T.("correct/not")(thickIdx & orientIdx));
bar(mu)
hold on
errorbar(1, mu, mu-CI(1), mu-CI(2), 'k-','LineWidth',1)
댓글 수: 14
Franck paulin Ludovig pehn Mayo
2022년 7월 16일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 16일
Scott MacKenzie
2022년 7월 17일
@Franck paulin Ludovig pehn Mayo, I'm just seeing your question now. There are comments and an answer from @Adam Danz, so perhaps we're done here. However, let me add a comment.
To me, the most informative part of your question is the grouped bar chart. It shows the relationship between two independent variables (x-axis) and a dependent variable (y-axis). The independent variables are orientation with 3 levels (horizontal, vertical, control) and thickness with 4 levels (0.02, 0.03, 0.04, and control). The dependent variable is recognition rate (%). This looks appropriate for an analysis of variance. And you are not alone in wondering how to do this in MATLAB: There are at least 5 MATLAB anova functions! The anova will help answer three questions:
- Is there a significant effect of orientation on recognition rate?
- Is there a significant effect of thickness on recognition rate?
- Is there a significant Orientation x Thickness interaction effect on recognition rate?
This can be setup fairly easily in MATLAB, but, first, there are some issues that need to be clarified. The experiment engaged nine participants ("individuals" in the question) with five repetitions of the measurements for each participant on each condition. But, there is no column in the data set indicating which rows correspond to which participants. Ditto for repetition. Can you add columns for the participant codes and repetition numbers?
Also, I assume "0" in the thickness column corresponds to the "control" level for thickness, but please confirm.
Finally, note that there is a small labelling error in the bar chart. The x-axis label corresponds to the bar groups. This should be "Thickness", not "Orientation". Orientation appears via the bars within groups. So, if you wish to include "Orientation" in the chart, it should appear as the title for the legend entries.
Franck paulin Ludovig pehn Mayo
2022년 7월 17일
Adam Danz
2022년 7월 18일
> I have attached the new data sheet...
If you'd like to apply the confidence intervals I suggested in my answer, you can get that started and share the code and I can help get you un-stuck if needed.
Franck paulin Ludovig pehn Mayo
2022년 7월 19일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 19일
I'll break down your code below.
Here, you're looking at data in column "Var5" of your table from rows that from conditions Var2==0.4 and Var4="vertical".
T = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1068365/Newfile.xlsx');
thickIdx = T.Var2 == 0.04;
orientIdx = strcmp(T.Var4, 'vertical');
data = T.("Var5")(thickIdx & orientIdx)
Then you're bootstrapping the mean from that selection of data. "bci" is the 95% confidence interval (CI) of the mean and "bmeans" are the 1000 bootstrapped means. See bootci for details.
%number of bootstapps
nBoot = 1000;
[bci,bmeans] = bootci(nBoot, {@mean,data}, 'Type', 'per')
I don't know why you want the mean of the bootstrapped means. Maybe you have good reason for this. The line you commented out computes the mean of the raw data.
% bootstrap sample mean
bmu = mean(bmeans);
%mu = mean(data);
I'm not sure what "lower level bootstrapping" is. Is that a term I used somewhere in another thread? The for-loop merely implements the same type of bootstrapping that the bootci function does above. I was probably comparing the bootci functionality to another method of directly implementing bootstrapping (still not sure where you saw this but it does look like mine). The randi function resamples the data with replacement which is important to do in bootstrapping. Then the prctile line computes the CIs with the percentile method in the same way bootci does when type='per'.
% Now repeat that process with lower-level bootstrapping
% using the same sampling proceedure and the same data.
bootMeans = nan(1,nBoot);
for i = 1:nBoot
bootMeans(i) = mean(data(randi(numel(data),size(data))));
end
CI = prctile(bootMeans,[5,95]);
mu = mean(bootMeans);
This part plots the distribution of bootstrapped means from bootci
% Plot
figure()
ax1 = subplot(2,1,1);
histogram(bmeans);
this adds the mean of the bootstrap means. Maybe you want to show the mean of the data instead.
hold on
xline(bmu, 'k-', sprintf('mu = %.2f',bmu),'LineWidth',2)
Here you add the bootci CIs
xline(bci(1),'k-',sprintf('%.1f',bci(1)),'LineWidth',2)
xline(bci(2),'k-',sprintf('%.1f',bci(2)),'LineWidth',2)
title('bootci()')
Then you repeat with the lower level bootstrapping method which unsurprisingly has the same results.
% plot the lower-level, direct computation results
ax2 = subplot(2,1,2);
histogram(bootMeans);
hold on
xline(mu, 'k-', sprintf('mu = %.2f',mu),'LineWidth',2)
xline(CI(1),'k-',sprintf('%.1f',CI(1)),'LineWidth',2)
xline(CI(2),'k-',sprintf('%.1f',CI(2)),'LineWidth',2)
title('Lower level')
linkaxes([ax1,ax2], 'xy')
% bar(bmu)
% hold on
% errorbar(1, mu, mu-CI(1), mu-CI(2), 'k-','LineWidth',1)
But this isn't what your initial goal is. This is useful to compute the CIs (use one method or the other, no need to do both). Your initial goal is to compute the CI, not to plot the distributions and such.
Once you have the CIs for each condition, you can add them to your bar plot using the errorbar function.
Franck paulin Ludovig pehn Mayo
2022년 7월 19일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 19일
Franck paulin Ludovig pehn Mayo
2022년 7월 20일
편집: Franck paulin Ludovig pehn Mayo
2022년 7월 20일
You're plotting the bars separately. Instead, plot them all together. bar([m1 m2 m3]) then apply the errorbars in the same way so you are creating 1 errorbar object that has 3 error bars.
It should looke like this,
bar([1 2 3])
hold on
errorbar([1 2 3], 1:3, rand(1,3), rand(1,3),'k-','LineStyle','none','LineWidth',1)
Franck paulin Ludovig pehn Mayo
2022년 7월 20일
Adam Danz
2022년 7월 20일
To "fix it" can be hairy.
Sometimes an insuffient amount of data is collected such that the sample of data does not reflect the unobservable full population of data. For example, if I'm calling people randomly to ask what their favorite ice cream is, maybe I accidentally called a disproportionaly high number of lactose intolerance people. In that case, then yes, collecting more data can reveal a more accurate picture of the population.
But if your sample of data already relfects the population, collecting more data will not change the outcome.
Most importantly, the amount of data you collect should not be decided from the resultant statistic. In other words, you should decide how much data to collect independtly from the results. Otherwise, that p-hacking and it's really bad science.
If your data reflect the underlying population, and if your bars overlap, then that's the result, that's the answer to your quesiton, that's reality. In that case, you cannot conclude that these two populations of means come from different distributions.
I did a study for 4 years and had those unexpected results - that two groups did not differ even though everyone expected them to differ. This is an opportunity to investigate why. Maybe previous studies had a different methodology or maybe the model should be viewed differently.
About comparing different conditions, all you have to do is change your indexing.
BTW I just noticed that your variables orientIdx2 orientIdx3 orientIdx4 are all the same thing. You only need one of those. Take some time to understand what these lines are doing.
Franck paulin Ludovig pehn Mayo
2022년 7월 20일
Adam Danz
2022년 7월 20일
- This is more of an art form than a science. There are lots of bits of advice out there to know when enough is enough. It's been obvoius to me when I don't have enough data but less obvious when I've collected enough. I have use cross validation to help make that decision. The main idea is, if I remove something like 10-20% of my data and get approximately the same results, then I have enough data.
- It wouldn't be surprising if the CIs differ by a very small amount between runs. bootci uses and random selection of your data so the results can differ by a very small amount. If you're getting noticable different results between runs, someting is wrong. Either you're not runing enough boot straps (1000 should be enough but you could try more) or you're not providing the same exact input data between runs. This is definitely something you want to investigate.
- I still don't understand your dataset enough to imagine this comparison. If any given data point has a thickness property and an orientation property and you want to know whether thickness or orientation has a stronger effect, then I don't think you can do that with this bootstrapping method which makes me fear that this entire multiple-day thread has nothing to do with your actual goals. The main lesson, if this is the case, is that the data and the goals must be crystal clear to you and to the readers before a useful answer can be written.
I realized you previously asked about NaNs in your bootci results but I forgot to address that question. By default, mean does not ignore NaNs and if there is a NaN in the data, the mean will be NaN. You want to omit nans using
___ = bootci(nBoot, {@(x)mean(x,'omitnan'),data}, 'Type', 'per')
That's all the time I have for this thread @Franck paulin Ludovig pehn Mayo. I hope these ideas will be helpful to you even if you don't end up needing them.
Franck paulin Ludovig pehn Mayo
2022년 7월 20일
카테고리
도움말 센터 및 File Exchange에서 Analysis of Variance and Covariance에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



