Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
          gpu          cuda          inference          nvidia          mha          mla          multi-head-attention          gqa          mqa          llm          large-language-model          flash-attention          cuda-core          decoding-attention          flashinfer          flashmla      
    - 
            Updated
            Jun 11, 2025 
- C++