Skip to content

Conversation

@aamijar
Copy link
Member

@aamijar aamijar commented Oct 14, 2025

Resolves #1201

This PR introduces the Spectral Clustering algorithm which builds on top of Spectral Embedding.

This PR will enable us to remove the legacy and inaccurate spectral clustering algorithm called partition

This PR also adds double precision support for spectral embedding and spectral clustering. The new templates require a refactor of the source files to .cuh and the template instantiations in .cu

Gtests have been added to this PR with synthetic data in a similar fashion as kmeans gtests.
I have tested this PR through python bindings and by comparing the plots with sklearn's SpectralClustering.

Here is a plot of sklearn vs ours.
Screenshot 2025-10-14 at 2 53 01 PM

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@aamijar aamijar self-assigned this Oct 14, 2025
@aamijar aamijar added the non-breaking Introduces a non-breaking change label Oct 14, 2025
@aamijar aamijar added the feature request New feature or request label Oct 14, 2025
@aamijar aamijar requested a review from jnke2016 October 14, 2025 15:54
@aamijar
Copy link
Member Author

aamijar commented Oct 15, 2025

Update:
I have added the current version of the gtests, which has a similar structure to how kmeans.cu does it.
I had to reduce the ARI threshold score.
I was able to setup this same test in python as well to check what happens with sklearn SpectralClustering when you pass in this synthetic data. However, it seems that sklearn gives a perfect 1.0 ARI. So there must be something going wrong in our implementation. I am digging through to see what the possible issue could be.

Update:
So it turns out that the solver is working fine, but we just need to increase the precision using smaller tolerance. I also confirmed that the laplacian step output matrix between sklearn and ours is the same.

@aamijar
Copy link
Member Author

aamijar commented Oct 16, 2025

Update:
I'm running into some problems with gpu KMeans potentially not being consistent compared to cpu KMeans. I wonder if that is expected? For example if I replace the KMeans in sklearn spectral clustering with cuml.cluster KMeans then it gives a much lower ARI.
This was solved by oversampling_factor=0

Update:
Key findings
oversampling_factor=0 k-means++ init method
increasing solver precision, tol=1e-5->0
fix to remove duplicate entries in laplacian
max_iterations = 10 * n_samples instead of fixed 1000
are able to improve the quality.
we are able to achieve near perfect scores compared to ground truth in the gtests, mean of 0.97 ARI across 20 runs.

@aamijar aamijar marked this pull request as ready for review October 18, 2025 01:09
@aamijar aamijar requested review from a team as code owners October 18, 2025 01:09
@@ -0,0 +1,38 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we name this file spectral.hpp instead of spectral_clustering. The clustering is already in the namespace, so it's redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f1ab56a

* @}
*/

void create_connectivity_graph(raft::resources const& handle,
Copy link
Member

@cjnolet cjnolet Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the standard "fit()" "predict()" APIs that one would normally use day to day, so please add it to the cuvs::preprocessing::spectral_embedding::helpers namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f1ab56a

@cjnolet cjnolet mentioned this pull request Oct 21, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aamijar aamijar force-pushed the spectral-clustering branch from 2bbd1e5 to 2d90bb5 Compare October 21, 2025 21:50
@aamijar aamijar force-pushed the spectral-clustering branch from f364560 to f377524 Compare October 22, 2025 23:01
@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, LGTM! It looks like the connectivity graph uses a int as its nnz type. I guess this is intentional and temporary?

Comment on lines 54 to 59
raft::linalg::transpose(handle,
embedding_col_major.data_handle(),
embedding_row_major.data_handle(),
n_samples,
config.n_components,
raft::resource::get_cuda_stream(handle));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that it is necessary to double VRAM usage and perform a transposition here, but would be great if we could find a workaround in the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Yes, we have to switch from col to row major since spectral embedding needs col major and kmeans needs row major. Yes, we are using int as nnz type currently until we resolve the scale up work in spectral embedding.

int n_components;
int n_init;
int n_neighbors;
uint64_t seed;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expose the rng_state instead of the seed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d9d4bc8


void fit_predict(raft::resources const& handle,
params config,
raft::device_coo_matrix_view<float, int, int, int> connectivity_graph,
Copy link

@jnke2016 jnke2016 Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft::device_coo_matrix_view<double, int, int, int> is not supported while cuGraph has templates for both float and double. Can we add support for the latter as well. Otherwise, i get the below error

error: no suitable user-defined conversion from "raft::device_coo_matrix_view<double, int, int, int>" to "raft::device_coo_matrix_view<float, int, int, int>" exists

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added double support in 5c608f3

@jnke2016
Copy link

I tested this PR in cuGraph and provided few suggestions. A cuGraph PR incorporating the cuVS Spectral Clustering API is ready for review and will be merged as soon as this PR is. @aamijar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

Spectral Clustering

4 participants