How to identify data set characteristics which influence the success of a model using those data sets as input.

Question

Wayne Martin 2024년 4월 22일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2110286-how-to-identify-data-set-characteristics-which-influence-the-success-of-a-model-using-those-data-set

댓글: Wayne Martin 2024년 5월 3일

I am studying the effect of hurricanes on coral reefs and have developed a damage prediction model which uses as inputs the fragility and distribution of different coral species at 150 post-storm survey sites. I can also create multiple simulated reefs by randomly assigning species, colonies and damage from the measured probability distribution functions of those parqameters for each species. When I make 1000 simulated reef experiments the results of my damage prediction are widly distributed from terrible to great. I need to mine the 1000 simultaed reefs to identify patterns which are influencing the success of the model. I expect this is a common scenario and would apprecieate any guidance on which tools to use and how to proceed. I have the statistics and machine learning toolbox.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Yatharth 2024년 5월 3일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2110286-how-to-identify-data-set-characteristics-which-influence-the-success-of-a-model-using-those-data-set#answer_1451831

Hello Wayne,

To answer your question on how you can identify data characteristics which influence the success of a model.

You can perform some basic Exploratory Data Analysis (EDA) to understand the distributions of your parameters and outcomes, identify outliers, and see if there are any obvious patterns or correlations.

Use "histogram", "boxplot", or "scatter" functions to visualize the distributions of your parameters and outcomes.
Use "corrplot" to visualize correlations between parameters and between parameters and outcomes.

With many input parameters, it's crucial to identify which ones significantly impact the model's outcome. Feature selection techniques can help reduce dimensionality and focus on the most influential variables.

Use "sequentialfs" (sequential feature selection) to identify the most important features. This function can help you find a subset of the input variables that most effectively predict the outcome.
Consider using principal component analysis (PCA) with "pca" to reduce dimensionality and possibly uncover underlying patterns in your data.

Here are the links for some of the mentioned functions: