- 
                Notifications
    You must be signed in to change notification settings 
- Fork 141
          Using byte_arithmetic_ptr in interleaved_scan_kernel
          #1437
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. | 
| void* data = nullptr; | ||
| bool is_signed = false; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing this on gpu will probably require three words, right? This could have the register usage consequence in some edge cases. If only could we rely on the data being aligned - then we could use the lowest bit of it as the signedness tag and pass this struct everywhere in place of the pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data being aligned doesn't guarantee that the lowest bit will be set in the int8_t case, right? If the first int8 value is positive then the sign bit won't be set.
Unless I misunderstood what you are saying?
We could always try to pass this as a reference to device functions instead, if that's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pointers will almost surely be aligned, but technically one can pass an unaligned pointer since the data is one byte granularity. My question is would it be acceptable for us to always assume it's at least two-byte aligned (or make an alignment check somewhere and throw an error otherwise)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be acceptable for us to check that. In the case of alignment, how do you propose we check for signed-ness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do something along the lines of this:
struct byte_arithmetic_ptr {
 private:
    constexpr static uintptr_t kSignMask = 0x1;
    uintptr_t value_;
  public:
    byte_arithmetic_ptr(uint8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr) {}
    byte_arithmetic_ptr(int8_t* ptr): value_(reinterpret_cast<uintptr_t>(ptr | kSignMask) {}
   
    constexpr void* get_data() {
      return reinterpret_cast<void*>(value_ & ~kSignMask);
    }
    constexpr bool is_signed() const {
      return (value_ & kSignMask) == kSignMask;
    }
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it looks like even our gtests don't pass. Here's an int8 address: 139761184407553 and it is not 2-byte aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have these pointers passed down from the public api anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean? The ivf_flat::search public API just passes these pointers along to ivf_flat_interleaved_scan function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so that's the query pointer what causes the problem (the ivf lists are aligned). Then it's debatable whether it makes sense to require it to be aligned. Could be good for performance, but (1) loading queries isn't a bottleneck, (2) it would require changes to the code where we increment by potentially odd offset
| query = query + query_id * dim; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this in mind for later, in case performance becomes an issue? I just posted some performance #s in the description body and IMO they look pretty good :)
byte_array in interleaved_scan_kernelbyte_arithmetic_ptr in interleaved_scan_kernel
      byte_arithmetic_ptr in interleaved_scan_kernelbyte_arithmetic_ptr in interleaved_scan_kernel
      | @divyegala Could you please look into the worst-case scenario where you reported almost two-fold slowdown? (int8, euclidean distance). 
 | 
| @achirkin here's the top 10 worst cases. To me the first one looks like possibly a one-off, not sure what happened there exactly. For the other cases, I think 20% is an acceptable perf drop keeping the context in mind that the raw numbers are a few hundred microseconds as well average drop only being ~5% meaning we make up the numbers elsewhere. Top 10 Worst Cases
 | 
| Are you doing micro-benchmarks or a full ann-bench search? Don't these translate to QPS 1-1? If so, 20% is not so little tbh | 
| @achirkin microbenchmarks via the Python API. I think the understanding is that we are okay with the tradeoff of a lesser performant  
 cc @cjnolet | 
| Ok, thanks for the clarification. Could you please still run the micro-benchmarks with more iterations/warmup or full benchmarks for the found worst-case scenario of 0.593x speedup? Just to be on the safe side | 
| Yes. Can you suggest which dataset you would like to see? I'll run cuvs-bench for it. | 
| @achirkin I also ran the worst-case benchmark a few more times and found the median time to go from 1.324 ms to 1.006 ms, which brings it to about a speedup of 0.781x. | 
| encV[k] = normalize_int8_packed(encV[k]); | ||
| queryRegs[k] = normalize_int8_packed(queryRegs[k]); | ||
| } | ||
| compute_dist(dist, queryRegs[k], encV[k]); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we supposed to normalize only when the metric is L2Expanded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this code-path is only instantiated for int8 when the metric is euclidean-based https://github.com/rapidsai/cuvs/pull/1437/files#diff-403c2fcd55246356fc13c2f369a109027c0983f6fafa31109d56f6f9cb273439R978-R980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good!
Modify generate_ivf_flat.py too to reflect this change.
| } | ||
|  | ||
| // Specialization for byte_arithmetic_ptr -> uint8_t* (for int8_t normalization) | ||
| __device__ inline void copy_vectorized(uint8_t* out, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can those function be in an other file that can be included? That way the other algorithms switching to byte array can use them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @tarang-jain did a scan and these functions are unused elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats correct! copy_vectorized is not being used elsewhere.
| 
 @lowener we don't use  | 
Building on top of idea in #1418
This PR reduces the binary size of CUDA 12.9
libcuvs.sofrom 1175.32 MB to 1127.43 MB.Benchmark Parameter Sweep
The benchmarks were run across the following parameter combinations:
int8,uint8inner_product,sqeuclideanTotal configurations tested: 1,280 (per implementation)
Overall Performance Summary
Performance Distribution