I have a data set that is x,y,v. Each vector is quite long, ~2 million rows. Also, the data is scattered, i.e. not at regular x,y intervals/spacings. An x-y plot of a small section is included here. When plotting x-y you can see groupings of 9 data points each. There will be ~200K of these groupings. I need to cluster the data set and get an average v value for each cluster of 9 points.
I have used kmeans clustering:
[idx,CC] = kmeans([x,y],round(length(x)/9),'Replicates',10,'Options',statset('UseParallel',1));
This works for maybe 66% of the clusters. In other words, about 66% of the clusters it finds are composed of 9 data points as desired. The other clusters are somewhat smaller or larger in the amount of data points comprising them. Also attached is a histogram of the amounts of data points in the clusters that kmeans returns. The large peak at 9 is what I want and is 66% of the total values...
If I know how many points should be in each cluster (9) and I know the intra-point distance within the cluster (because the spacing of the 9 data points within a cluster is constant), is there a way to improve upon these results of kmeans? Can I stipulate that a cluster must have 9 points within it? Can I stipulate that a cluster can have a distance between its members that is no more than a prescribed value?
Other things I'm considering is looping over a kmeans calculation and each time just keeping the clusters that have the 9 points--repeating the kmeans for the data set with those entries removed. This may work but it seems there should be a better way considering I know a decent amount about the data structure.
Thanks for any suggestions!

댓글 수: 5

the cyclist
the cyclist 2022년 9월 1일
Can you upload the data in a MAT file?
Paul Safier
Paul Safier 2022년 9월 1일
@the cyclist I will take a look at the DBSCAN documentation, thanks.
Some test data is attached.
Paul Safier
Paul Safier 2022년 9월 2일
@the cyclist I was able to get the kmeans method to work in a loop such that it iterates until all the clusters are properly captured. It is, however, relatively slow and takes a lot of memory.
The dbscan method you suggested worked as well and isn't too time or memory expensive. Thanks for the suggestion!
the cyclist
the cyclist 2022년 9월 3일
편집: the cyclist 2022년 9월 3일
Glad it worked out for you.
I'm curious what inputs worked for you, using dbscan (and if success seemed dependent on getting them right).
Paul Safier
Paul Safier 2022년 9월 3일
@the cyclist. I used 9 for minpts since I know the clusters should have 9 points in them all. For epsilon, I iterated until the output found the correct amount of clusters. I used a smaller test clip for this and I knew from inspecting a plot how many clusters I needed to get. The value was 2.4276e-5. My clusters are all the same size so I may have an easier-than-normal problem. Thanks again.

댓글을 달려면 로그인하십시오.

 채택된 답변

the cyclist
the cyclist 2022년 9월 1일

0 개 추천

I think you might have success if you try the DBSCAN algorithm instead.

추가 답변 (1개)

Image Analyst
Image Analyst 2022년 9월 3일

0 개 추천

See my attached dbscan demo.

댓글 수: 1

Paul Safier
Paul Safier 2022년 9월 3일
@Image Analyst. I will look this over. Many years back I found another of your demos quite useful. Thanks for making them!

댓글을 달려면 로그인하십시오.

제품

릴리스

R2022a

질문:

2022년 9월 1일

댓글:

2022년 9월 3일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by