Skip to content

Conversation

@divyegala
Copy link
Member

@divyegala divyegala commented Oct 18, 2025

Building on top of idea in #1418

This PR reduces the binary size of CUDA 12.9 libcuvs.so from 1175.32 MB to 1127.43 MB.

Benchmark Parameter Sweep

The benchmarks were run across the following parameter combinations:

  • Data types: int8, uint8
  • Distance metrics: inner_product, sqeuclidean
  • Dataset sizes (n_rows): 25,000, 100,000, 250,000, 500,000, 1,000,000
  • Dimensions (n_cols): 64, 128, 256, 512
  • Query batch sizes (n_queries): 10, 100, 1,000, 10,000
  • k values: 32, 64, 128, 256

Total configurations tested: 1,280 (per implementation)

Overall Performance Summary

dtype metric tests avg speedup median speedup avg overhead baseline QPS hybrid QPS
int8 sqeuclidean 320 0.9579x 0.9661x +4.94% 556.8K 549.6K
int8 inner_product 320 1.0118x 1.0097x -1.05% 586.9K 589.3K
uint8 sqeuclidean 320 0.9988x 0.9808x +0.62% 545.4K 557.2K
uint8 inner_product 320 0.9745x 0.9910x +3.11% 170.9K 167.9K

Performance Distribution

dtype metric min speedup 25th % 50th % 75th % max speedup
int8 sqeuclidean 0.593x 0.926x 0.966x 1.000x 1.087x
int8 inner_product 0.878x 1.000x 1.010x 1.029x 1.174x
uint8 sqeuclidean 0.829x 0.957x 0.981x 1.046x 1.272x
uint8 inner_product 0.755x 0.954x 0.991x 1.007x 1.164x

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 18, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@divyegala divyegala marked this pull request as ready for review October 18, 2025 05:43
@divyegala divyegala requested review from a team as code owners October 18, 2025 05:43
@divyegala divyegala requested review from a team as code owners October 20, 2025 03:27
@divyegala divyegala requested a review from bdice October 20, 2025 03:27
Comment on lines +24 to +25
void* data = nullptr;
bool is_signed = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing this on gpu will probably require three words, right? This could have the register usage consequence in some edge cases. If only could we rely on the data being aligned - then we could use the lowest bit of it as the signedness tag and pass this struct everywhere in place of the pointers.

Copy link
Member Author

@divyegala divyegala Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data being aligned doesn't guarantee that the lowest bit will be set in the int8_t case, right? If the first int8 value is positive then the sign bit won't be set.

Unless I misunderstood what you are saying?

We could always try to pass this as a reference to device functions instead, if that's helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pointers will almost surely be aligned, but technically one can pass an unaligned pointer since the data is one byte granularity. My question is would it be acceptable for us to always assume it's at least two-byte aligned (or make an alignment check somewhere and throw an error otherwise)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be acceptable for us to check that. In the case of alignment, how do you propose we check for signed-ness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do something along the lines of this:

struct byte_arithmetic_ptr {
 private:
    constexpr static uintptr_t kSignMask = 0x1;
    uintptr_t value_;
  public:
    byte_arithmetic_ptr(uint8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr) {}
    byte_arithmetic_ptr(int8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr | kSignMask) {}
   
    constexpr void* get_data() {
      return reinterpret_cast<void*>(value_ & ~kSignMask);
    }
    constexpr bool is_signed() const {
      return (value_ & kSignMask) == kSignMask;
    }
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it looks like even our gtests don't pass. Here's an int8 address: 139761184407553 and it is not 2-byte aligned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have these pointers passed down from the public api anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what do you mean? The ivf_flat::search public API just passes these pointers along to ivf_flat_interleaved_scan function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so that's the query pointer what causes the problem (the ivf lists are aligned). Then it's debatable whether it makes sense to require it to be aligned. Could be good for performance, but (1) loading queries isn't a bottleneck, (2) it would require changes to the code where we increment by potentially odd offset

query = query + query_id * dim;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this in mind for later, in case performance becomes an issue? I just posted some performance #s in the description body and IMO they look pretty good :)

@divyegala divyegala changed the title Using byte_array in interleaved_scan_kernel Using byte_arithmetic_ptr in interleaved_scan_kernel Oct 20, 2025
@divyegala divyegala changed the title Using byte_arithmetic_ptr in interleaved_scan_kernel Using byte_arithmetic_ptr in interleaved_scan_kernel Oct 20, 2025
@achirkin
Copy link
Contributor

@divyegala Could you please look into the worst-case scenario where you reported almost two-fold slowdown? (int8, euclidean distance).

  • Is it one or few consistent cases or random fluctuation?
  • Is it one of the small-data-size case that are latency bound, or a more relevant big-data-size case where the throughput impacted due to e.g. accidentally broken vectorized loads?

@divyegala
Copy link
Member Author

@achirkin here's the top 10 worst cases. To me the first one looks like possibly a one-off, not sure what happened there exactly. For the other cases, I think 20% is an acceptable perf drop keeping the context in mind that the raw numbers are a few hundred microseconds as well average drop only being ~5% meaning we make up the numbers elsewhere.

Top 10 Worst Cases

Rank n_rows n_cols n_queries k Baseline (ms) Hybrid (ms) Speedup Overhead (μs/query)
1 1,000,000 64 100 128 0.786 1.324 0.593x 5.4
2 100,000 512 10 32 0.447 0.590 0.758x 14.3
3 250,000 64 10 32 0.419 0.535 0.783x 11.6
4 25,000 512 10 128 0.444 0.564 0.786x 12.1
5 100,000 128 100 128 0.449 0.571 0.787x 1.2
6 500,000 128 10 128 0.432 0.546 0.790x 11.5
7 25,000 512 10 32 0.437 0.550 0.794x 11.3
8 25,000 512 100 64 0.465 0.579 0.803x 1.1
9 25,000 512 10 64 0.430 0.532 0.807x 10.3
10 250,000 64 10 64 0.420 0.520 0.808x 10.0

@achirkin
Copy link
Contributor

Are you doing micro-benchmarks or a full ann-bench search? Don't these translate to QPS 1-1? If so, 20% is not so little tbh

@divyegala
Copy link
Member Author

@achirkin microbenchmarks via the Python API. I think the understanding is that we are okay with the tradeoff of a lesser performant int8 for binary size savings. I think the overall impact of 5% for int8 + sqeuclidean is not bad at all, given that this is also a temporary measure. As mentioned offline:

  1. CUDA 13 will support the full type matrix via JIT/LTO
  2. We will reverse these changes for CUDA 12 when we upgrade to driver r580
  3. int8 is a lesser used type than uint8

cc @cjnolet

@achirkin
Copy link
Contributor

Ok, thanks for the clarification. Could you please still run the micro-benchmarks with more iterations/warmup or full benchmarks for the found worst-case scenario of 0.593x speedup? Just to be on the safe side

@divyegala
Copy link
Member Author

Yes. Can you suggest which dataset you would like to see? I'll run cuvs-bench for it.

@divyegala
Copy link
Member Author

@achirkin I also ran the worst-case benchmark a few more times and found the median time to go from 1.324 ms to 1.006 ms, which brings it to about a speedup of 0.781x.

encV[k] = normalize_int8_packed(encV[k]);
queryRegs[k] = normalize_int8_packed(queryRegs[k]);
}
compute_dist(dist, queryRegs[k], encV[k]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we supposed to normalize only when the metric is L2Expanded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code-path is only instantiated for int8 when the metric is euclidean-based https://github.com/rapidsai/cuvs/pull/1437/files#diff-403c2fcd55246356fc13c2f369a109027c0983f6fafa31109d56f6f9cb273439R978-R980

Copy link
Contributor

@lowener lowener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good!
Modify generate_ivf_flat.py too to reflect this change.

}

// Specialization for byte_arithmetic_ptr -> uint8_t* (for int8_t normalization)
__device__ inline void copy_vectorized(uint8_t* out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can those function be in an other file that can be included? That way the other algorithms switching to byte array can use them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @tarang-jain did a scan and these functions are unused elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats correct! copy_vectorized is not being used elsewhere.

@divyegala
Copy link
Member Author

Modify generate_ivf_flat.py too to reflect this change.

@lowener we don't use generate_ivf_flat.py to generate the TUs changed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants