Using `byte_arithmetic_ptr` in `interleaved_scan_kernel` #1437

divyegala · 2025-10-18T05:43:04Z

Building on top of idea in #1418

This PR reduces the binary size of CUDA 12.9 libcuvs.so from 1175.32 MB to 1127.43 MB.

Benchmark Parameter Sweep

The benchmarks were run across the following parameter combinations:

Data types: int8, uint8
Distance metrics: inner_product, sqeuclidean
Dataset sizes (n_rows): 25,000, 100,000, 250,000, 500,000, 1,000,000
Dimensions (n_cols): 64, 128, 256, 512
Query batch sizes (n_queries): 10, 100, 1,000, 10,000
k values: 32, 64, 128, 256

Total configurations tested: 1,280 (per implementation)

Overall Performance Summary

dtype	metric	tests	avg speedup	median speedup	avg overhead	baseline QPS	hybrid QPS
int8	sqeuclidean	320	0.9579x	0.9661x	+4.94%	556.8K	549.6K
int8	inner_product	320	1.0118x	1.0097x	-1.05%	586.9K	589.3K
uint8	sqeuclidean	320	0.9988x	0.9808x	+0.62%	545.4K	557.2K
uint8	inner_product	320	0.9745x	0.9910x	+3.11%	170.9K	167.9K

Performance Distribution

dtype	metric	min speedup	25th %	50th %	75th %	max speedup
int8	sqeuclidean	0.593x	0.926x	0.966x	1.000x	1.087x
int8	inner_product	0.878x	1.000x	1.010x	1.029x	1.174x
uint8	sqeuclidean	0.829x	0.957x	0.981x	1.046x	1.272x
uint8	inner_product	0.755x	0.954x	0.991x	1.007x	1.164x

copy-pr-bot · 2025-10-18T05:43:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpp/include/cuvs/core/byte_array.hpp

achirkin · 2025-10-20T07:02:46Z

cpp/include/cuvs/core/byte_arithmetic_ptr.hpp

+  void* data     = nullptr;
+  bool is_signed = false;


Passing this on gpu will probably require three words, right? This could have the register usage consequence in some edge cases. If only could we rely on the data being aligned - then we could use the lowest bit of it as the signedness tag and pass this struct everywhere in place of the pointers.

The data being aligned doesn't guarantee that the lowest bit will be set in the int8_t case, right? If the first int8 value is positive then the sign bit won't be set.

Unless I misunderstood what you are saying?

We could always try to pass this as a reference to device functions instead, if that's helpful.

The pointers will almost surely be aligned, but technically one can pass an unaligned pointer since the data is one byte granularity. My question is would it be acceptable for us to always assume it's at least two-byte aligned (or make an alignment check somewhere and throw an error otherwise)?

I think it should be acceptable for us to check that. In the case of alignment, how do you propose we check for signed-ness?

I'd do something along the lines of this:

struct byte_arithmetic_ptr { private: constexpr static uintptr_t kSignMask = 0x1; uintptr_t value_; public: byte_arithmetic_ptr(uint8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr) {} byte_arithmetic_ptr(int8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr | kSignMask) {} constexpr void* get_data() { return reinterpret_cast<void*>(value_ & ~kSignMask); } constexpr bool is_signed() const { return (value_ & kSignMask) == kSignMask; } };

Unfortunately, it looks like even our gtests don't pass. Here's an int8 address: 139761184407553 and it is not 2-byte aligned.

Do we have these pointers passed down from the public api anywhere?

Sorry, what do you mean? The ivf_flat::search public API just passes these pointers along to ivf_flat_interleaved_scan function

Ah, so that's the query pointer what causes the problem (the ivf lists are aligned). Then it's debatable whether it makes sense to require it to be aligned. Could be good for performance, but (1) loading queries isn't a bottleneck, (2) it would require changes to the code where we increment by potentially odd offset

cuvs/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh

Line 926 in ef65e5a

query = query + query_id * dim;

I think we can keep this in mind for later, in case performance becomes an issue? I just posted some performance #s in the description body and IMO they look pretty good :)

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh

achirkin · 2025-10-22T09:16:29Z

@divyegala Could you please look into the worst-case scenario where you reported almost two-fold slowdown? (int8, euclidean distance).

Is it one or few consistent cases or random fluctuation?
Is it one of the small-data-size case that are latency bound, or a more relevant big-data-size case where the throughput impacted due to e.g. accidentally broken vectorized loads?

divyegala · 2025-10-22T16:15:52Z

@achirkin here's the top 10 worst cases. To me the first one looks like possibly a one-off, not sure what happened there exactly. For the other cases, I think 20% is an acceptable perf drop keeping the context in mind that the raw numbers are a few hundred microseconds as well average drop only being ~5% meaning we make up the numbers elsewhere.

Top 10 Worst Cases

Rank	n_rows	n_cols	n_queries	k	Baseline (ms)	Hybrid (ms)	Speedup	Overhead (μs/query)
1	1,000,000	64	100	128	0.786	1.324	0.593x	5.4
2	100,000	512	10	32	0.447	0.590	0.758x	14.3
3	250,000	64	10	32	0.419	0.535	0.783x	11.6
4	25,000	512	10	128	0.444	0.564	0.786x	12.1
5	100,000	128	100	128	0.449	0.571	0.787x	1.2
6	500,000	128	10	128	0.432	0.546	0.790x	11.5
7	25,000	512	10	32	0.437	0.550	0.794x	11.3
8	25,000	512	100	64	0.465	0.579	0.803x	1.1
9	25,000	512	10	64	0.430	0.532	0.807x	10.3
10	250,000	64	10	64	0.420	0.520	0.808x	10.0

achirkin · 2025-10-22T17:10:53Z

Are you doing micro-benchmarks or a full ann-bench search? Don't these translate to QPS 1-1? If so, 20% is not so little tbh

divyegala · 2025-10-22T17:18:41Z

@achirkin microbenchmarks via the Python API. I think the understanding is that we are okay with the tradeoff of a lesser performant int8 for binary size savings. I think the overall impact of 5% for int8 + sqeuclidean is not bad at all, given that this is also a temporary measure. As mentioned offline:

CUDA 13 will support the full type matrix via JIT/LTO
We will reverse these changes for CUDA 12 when we upgrade to driver r580
int8 is a lesser used type than uint8

cc @cjnolet

achirkin · 2025-10-22T17:22:28Z

Ok, thanks for the clarification. Could you please still run the micro-benchmarks with more iterations/warmup or full benchmarks for the found worst-case scenario of 0.593x speedup? Just to be on the safe side

divyegala · 2025-10-22T17:29:18Z

Yes. Can you suggest which dataset you would like to see? I'll run cuvs-bench for it.

divyegala · 2025-10-22T17:44:30Z

@achirkin I also ran the worst-case benchmark a few more times and found the median time to go from 1.324 ms to 1.006 ms, which brings it to about a speedup of 0.781x.

tarang-jain · 2025-10-23T00:26:16Z

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh

+          encV[k]      = normalize_int8_packed(encV[k]);
+          queryRegs[k] = normalize_int8_packed(queryRegs[k]);
+        }
        compute_dist(dist, queryRegs[k], encV[k]);


Aren't we supposed to normalize only when the metric is L2Expanded?

Yes, this code-path is only instantiated for int8 when the metric is euclidean-based https://github.com/rapidsai/cuvs/pull/1437/files#diff-403c2fcd55246356fc13c2f369a109027c0983f6fafa31109d56f6f9cb273439R978-R980

lowener

The changes look good!
Modify generate_ivf_flat.py too to reflect this change.

lowener · 2025-10-24T14:42:38Z

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh

 }

+// Specialization for byte_arithmetic_ptr -> uint8_t* (for int8_t normalization)
+__device__ inline void copy_vectorized(uint8_t* out,


Can those function be in an other file that can be included? That way the other algorithms switching to byte array can use them

I believe @tarang-jain did a scan and these functions are unused elsewhere.

Thats correct! copy_vectorized is not being used elsewhere.

divyegala · 2025-10-27T19:37:24Z

Modify generate_ivf_flat.py too to reflect this change.

@lowener we don't use generate_ivf_flat.py to generate the TUs changed in this PR.

byte array in interleaved scan

f141e24

divyegala added feature request New feature or request non-breaking Introduces a non-breaking change labels Oct 18, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Oct 18, 2025

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Oct 18, 2025

Merge branch 'branch-25.12' into byte-array-interleaved

a944636

divyegala marked this pull request as ready for review October 18, 2025 05:43

divyegala requested review from a team as code owners October 18, 2025 05:43

divyegala added 5 commits October 19, 2025 17:13

use functor and transform once

bfa1574

merge upstream

8a42ab0

use implicit type casting

f278011

delete double instantiations

3ba9c0a

add prints

7fe48df

divyegala requested review from a team as code owners October 20, 2025 03:27

divyegala requested a review from bdice October 20, 2025 03:27

-s -v

35174e6

achirkin reviewed Oct 20, 2025

View reviewed changes

cpp/include/cuvs/core/byte_array.hpp Outdated Show resolved Hide resolved

achirkin reviewed Oct 20, 2025

View reviewed changes

divyegala added 2 commits October 20, 2025 11:15

Merge branch 'branch-25.12' into byte-array-interleaved

591f9d6

rename byte_array to byte_arithmetic_ptr

ef65e5a

divyegala changed the title ~~Using byte_array in interleaved_scan_kernel~~ Using byte_arithmetic_ptr in interleaved_scan_kernel Oct 20, 2025

divyegala changed the title ~~Using byte_arithmetic_ptr in interleaved_scan_kernel~~ Using byte_arithmetic_ptr in interleaved_scan_kernel Oct 20, 2025

tarang-jain mentioned this pull request Oct 20, 2025

[FEA] Binary IVF Flat Index #1099

Open

tarang-jain requested changes Oct 20, 2025

View reviewed changes

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh Outdated Show resolved Hide resolved

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh Outdated Show resolved Hide resolved

tarang-jain reviewed Oct 20, 2025

View reviewed changes

cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh Outdated Show resolved Hide resolved

divyegala added 2 commits October 22, 2025 00:01

make corrections

60fba9a

post lambda as runtime param

b09ecb1

merge upstream

1247fa4

cjnolet assigned divyegala Oct 22, 2025

cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Oct 22, 2025

tarang-jain reviewed Oct 23, 2025

View reviewed changes

lowener reviewed Oct 24, 2025

View reviewed changes

divyegala added 2 commits October 27, 2025 19:39

merge upstream

9b0aceb

delete one more

571817b

Using byte_arithmetic_ptr in interleaved_scan_kernel #1437

Are you sure you want to change the base?

Using byte_arithmetic_ptr in interleaved_scan_kernel #1437

Uh oh!

Conversation

divyegala commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Parameter Sweep

Overall Performance Summary

Performance Distribution

Uh oh!

copy-pr-bot bot commented Oct 18, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyegala Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

achirkin commented Oct 22, 2025

Uh oh!

divyegala commented Oct 22, 2025

Top 10 Worst Cases

Uh oh!

achirkin commented Oct 22, 2025

Uh oh!

divyegala commented Oct 22, 2025

Uh oh!

achirkin commented Oct 22, 2025

Uh oh!

divyegala commented Oct 22, 2025

Uh oh!

divyegala commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowener left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyegala commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Using `byte_arithmetic_ptr` in `interleaved_scan_kernel` #1437

Using `byte_arithmetic_ptr` in `interleaved_scan_kernel` #1437

divyegala commented Oct 18, 2025 •

edited

Loading

divyegala Oct 20, 2025 •

edited

Loading